Move S3 Objects faster without any hurdles

Balachandar Paulraj
3 min readMar 21, 2021

If you have ever tried to copy/move millions of objects from one bucket to another bucket, i bet ya that won’t be as simple as is. Below are some of the options everyone could think of.

  1. Initiating copy/move transformations through AWS console: This operation should’t get disturbed at any cost. Especially, if your organization uses something like automatic sign-out based on idle time, then it adds more time to complete the required operation by triggering it manually every time during timeout/failure.
  2. AWS CLI CP Command: This won’t break too frequently like AWS console, however in case if it breaks, it’s hard to identify the files not copied to target bucket. Any manual errors could potentially cause duplicates in target bucket.
  3. S3 Replication: This might suit most of the project use cases and it’s also completely managed by AWS. But, turning on replication requires versioning to be enabled in both source and destination buckets. Versioning introduces lot of issues especially when a spark job writes the contents to source bucket, because the job creates a staging directory and then transfer the contents of the staging directory to S3 bucket. In this case, every occurrence of Spark job creates different staging directory which introduces different versions of invalid data with staging directory.

By considering all the above issues while transferring data, it’s worth to look into some of the options below.

  1. S3 Batch operations: S3 Batch operations is a feature introduced by AWS to perform large scale batch operations on S3 objects across buckets. Batch operation expect a manifest (input csv file contains bucket and object details) as it’s input and executes given operations on top of it. Core component of batch operations is JOB, which possess details of S3 bucket along with objects on which required operation needs to be performed. It can execute a single or multiple operation on lists of Amazon S3 objects that was specified in manifest file. JOB supports multiple operations that includes 1) PUT copy object, 2) PUT object tagging, 3) PUT object ACL, 4) Initiate S3 Glacier Restore and 5) Invoke AWS Lambda function..etc
  2. S3DistCp: S3DistCp is an extension of Apache DistCp(open-source tool to copy large chunk of data) that is optimized to work with AWS S3. S3DistCp is more scalable and efficient for parallel copying large number of objects across AWS buckets. S3DistCp usesMapReduce to copy in a distributed manner. It shares the copy, error handling and reporting tasks several servers. S3DistCp also supports incremental copy of data using options outputManifest and previousManifest, which compares output with previous manifest to identify the remaining objects. Consider the below example that does incremental copy from one S3 to another bucket.
s3-dist-cp 
--src s3://source-bucket/source-folder/hourly_table
--dest s3://target-bucket/target-folder/hourly_table
--srcPattern .*\.log
--outputManifest=manifest-newFile.gz
--previousManifest=s3://log-bucket/log-folder/manifest-oldFile.gz

3. AWS CLI SYNC Command:

  • AWS Sync command keeps source and target directories in sync.
  • Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination.
  • Main difference between CP and SYNC command is that SYNC command never introduce duplicates since it copies only new/modified files.
  • Only creates target folders in the destination if they contain one or more new/modified files.

Thank you for going through this post. Please keep me posted for any comments.

--

--