Solutions for Migrating Unstructured Data from On-Premises to AWS

6 min read

Three specific use cases around unstructured data migration to AWS.

During cloud migrations, we come across scenarios where there is a need to migrate or transfer files (typically unstructured data), from on-premises (SAN/NAS) to a specific storage service in AWS (e.g., EBS/EFS/S3/FSx). These can be files generated by the application, user uploads, integration files that are created by one application and consumed by others (B2B), etc. In most cases, these unstructured data may vary in total size from a few MBs to 1 TB, and most importantly, the underlying application is not expected to undergo a lot of remediation to utilize the target AWS service.

In this blog post, we share our experience with three specific use cases around unstructured data migration to AWS:

  • In the first scenario, where the requirement is to share data among multiple VMs/applications, we describe how unstructured data from a Network Attached Storage (NAS) was migrated to AWS.
  • In the second scenario, we talk about how we migrated B2B data to AWS Storage.
  • In the third scenario, where the unstructured data exists in the native file system (NTFS, xfs or ext4) and not exposed to the network as a File Share, we discuss how the data in Windows/Linux instances is migrated to AWS.

1. From network attached storage (NAS) to AWS using AWS DataSync

Problem/scenario

Application A picks up incoming files from an Application X, processes them and generates data files that are 50–300 GB. That, then, becomes the input for another Application Y to consume. The data is shared by means of an NFS Storage accessible to all three applications.

Application A is being migrated to AWS and the Applications X and Y continue to remain on-premises. We used AWS Elastic File System (EFS) to replace NFS on AWS. However, that makes it difficult for the applications to read/write from a common storage solution, and network latency slows down Application X and Application Y

Solution

In this case, we used AWS DataSync Service to perform the initial migration of nearly 1 TB of data from the on-premises NFS storage to AWS EFS.

AWS DataSync can transfer data between any two network storage or object storage. These could be network file systems (NFS), server message block (SMB) file servers, Hadoop distributed file systems (HDFS), self-managed object storage, AWS Snowcone, Amazon Simple Storage Service (Amazon S3) buckets, Amazon Elastic File System (Amazon EFS) file systems, Amazon FSx for Windows File Server file systems, Amazon FSx for Lustre file systems and Amazon FSx for OpenZFS file systems.

These are network file systems (NFS), server message block (SMB) file servers, Hadoop distributed file systems (HDFS), self-managed object storage, AWS Snowcone, Amazon Simple Storage Service (Amazon S3) buckets, Amazon Elastic File System (Amazon EFS) file systems, Amazon FSx for Windows File Server file systems, Amazon FSx for Lustre file systems and Amazon FSx for OpenZFS file systems.

To solve the need for the applications to read/write from a common storage solution and address the network latency involved during read/write operations across the Direct Connect, we scheduled a regular synchronization of the specific input and output folders using the AWS DataSync service between the NFS and EFS. This means that all three applications look at same set of files after the sync is complete.

Challenges

  • Syncs can be scheduled at minimum one-hour intervals. This soft limit can be modified for up to 15-minutes intervals, however, that leads to performance issues and subsequent sync schedules getting queued up, which forms a loop.
  • Bidirectional Syncs were configured to run in a queued fashion. That is, only one-way sync can be executed at a time. Applications will have to read the files after the sync interval is completed. In our case, files are generated only one time per day, so this challenge was mitigated by scheduling the read/writes in a timely fashion.

Cost implications

  • No fixed/upfront cost and only $0.0125 per gigabyte (GB) for data transferred.
  • AWS DataSync Agent (virtual appliance) must be installed on a dedicated VM on-premises.

2. Data/Files in FTP locations to AWS via AWS Transfer family

Problem/scenario

Another Application B had to process lot of unstructured data from an FTP location. These files were transferred to the application server through SFTP by dependent applications. Since this application is moved to AWS, we must have the dependent applications also transfer these files to a storage location in AWS.

Solution

AWS Transfer Family provides options to transfer your unstructured data to an S3 bucket or an EFS Storage using SFTP, FTPS or FTP protocols. This easily integrates with any standard FTP client (GUI- or CLI-based) and thus allows you to transfer your data from on-premises to AWS. As a managed service backed by in-built autoscaling features, it can be deployed in up to three availability zones to achieve high availability and resiliency.

Private VPC endpoints are available to securely transfer data within the internal network.

AWS Transfer Family can also be used for a one-time data migration for B2B Managed File Transfer.

AWS Transfer Family can also be used for a one-time data migration for B2B Managed File Transfer.

We used an EFS mount on the application server and directed the other dependent applications to use the AWS Transfer Family SFTP private endpoint to send the files securely. The authentication was handled via SSH Key Pair so that there is no hardcoded username/password in either location. This way, we do not expose the application server over SSH port 22, which was a client-mandated security control.

Challenges

It was very easy to set up and get going because our application was running in Linux.

However, FSx is not a supported target storage option because AWS Transfer Family suits use cases for a target application hosted on Linux platforms. Some additional programming is needed to access an S3 Bucket if a Windows-based application must consume these managed services.

Cost implications

There is a $0.30 per hour fixed charge while the service is enabled and $0.04 per gigabyte (GB) data upload/download charges are applicable.

3. From Windows/Linux local storage to AWS using rsync/robocopy

Problem/scenario

Application C used (read/write) a lot of data from a native file system, which was needed in AWS when the application was migrated to AWS. This data on native file systems could not be migrated as-is to EBS Volumes or EFS Storage because both source and target should be network file storage to use the AWS native file/data transfer solutions.

While we could have presented the native file system as NFS share and used AWS DataSync as in the second scenario, this would have required additional installation and configuration on source servers, which is usually not desired in case of Migrations.

Solution

We used traditional tools like rsync/robocopy to copy data to AWS Storage like EFS (mounted on EC2) or EBS volumes.

We used traditional tools like rsync/robocopy to copy data to AWS Storage like EFS (mounted on EC2) or EBS volumes.

We used a shell script based on rsync to pull data from an on-premises server to the EC2 instance, keeping in mind the security mandate not to expose EC2 instances on SSH port 22. Due to rsync features and good bandwidth available with Direct Connect, the data migration was seamless.

Challenges

While rsync/robocopy is a good fit for the above problem, it may not be suitable if the following characteristics are exhibited by the application and the environment:

  1. If on-prem and target storage is a network file system, then the preferred option would be to use AWS DataSync due to its advanced features to schedule, etc.
  2. If the size of the data exceeds 1-2 TB, which would otherwise lead to bandwidth throttling.
  3. Migration of data and then regular synchronization of data between on-prem and AWS.
  4. Security rules in place in most organizations prevent inbound security group rules from allowing direct access to EC2 instances on Port 22. In such cases, the 'pull' from on-prem storage can be 'initiated from AWS' and now only the outbound security group in AWS needs to allow traffic over Port 22, which organizations would allow.

Cost implications

There are no ingress charges, and it is $0.08 to $0.12 per gigabyte (GB) for egress to Internet/on-premises.

Conclusion

In this post, we discussed very common use cases in data migrations to AWS Cloud and how native and traditional tools are used to tackle some unique situations. To summarize our experience, a quick comparison of these tools is depicted below:

In this post, we discussed very common use cases in data migrations to AWS Cloud and how native and traditional tools are used to tackle peculiar situations. To summarize our experience, a quick comparison of these tools is depicted below:

We did not discuss the option of using AWS Snow Family due to feasibility issues in the scenarios. It requires physical access to a data center and is only appropriate for transferring very large data (in many TBs) — our data was not very large for any of the above use cases.

Similarly, AWS Storage Gateway was not considered as it is ideal for on-prem backup/archival/DR scenarios and none of the use cases had that requirement.

There are managed services available on AWS for data migrations and each of them cater to a very specific set of use cases.

We will continue to share our experience as we encounter new scenarios for transferring or storing unstructured data in AWS.

References

Be the first to hear about news, product updates, and innovation from IBM Cloud