Move S3 Objects faster without any hurdles

  1. Initiating copy/move transformations through AWS console: This operation should’t get disturbed at any cost. Especially, if your organization uses something like automatic sign-out based on idle time, then it adds more time to complete the required operation by triggering it manually every time during timeout/failure.
  2. AWS CLI CP Command: This won’t break too frequently like AWS console, however in case if it breaks, it’s hard to identify the files not copied to target bucket. Any manual errors could potentially cause duplicates in target bucket.
  3. S3 Replication: This might suit most of the project use cases and it’s also completely managed by AWS. But, turning on replication requires versioning to be enabled in both source and destination buckets. Versioning introduces lot of issues especially when a spark job writes the contents to source bucket, because the job creates a staging directory and then transfer the contents of the staging directory to S3 bucket. In this case, every occurrence of Spark job creates different staging directory which introduces different versions of invalid data with staging directory.
  1. S3 Batch operations: S3 Batch operations is a feature introduced by AWS to perform large scale batch operations on S3 objects across buckets. Batch operation expect a manifest (input csv file contains bucket and object details) as it’s input and executes given operations on top of it. Core component of batch operations is JOB, which possess details of S3 bucket along with objects on which required operation needs to be performed. It can execute a single or multiple operation on lists of Amazon S3 objects that was specified in manifest file. JOB supports multiple operations that includes 1) PUT copy object, 2) PUT object tagging, 3) PUT object ACL, 4) Initiate S3 Glacier Restore and 5) Invoke AWS Lambda function..etc
  2. S3DistCp: S3DistCp is an extension of Apache DistCp(open-source tool to copy large chunk of data) that is optimized to work with AWS S3. S3DistCp is more scalable and efficient for parallel copying large number of objects across AWS buckets. S3DistCp usesMapReduce to copy in a distributed manner. It shares the copy, error handling and reporting tasks several servers. S3DistCp also supports incremental copy of data using options outputManifest and previousManifest, which compares output with previous manifest to identify the remaining objects. Consider the below example that does incremental copy from one S3 to another bucket.
s3-dist-cp 
--src s3://source-bucket/source-folder/hourly_table
--dest s3://target-bucket/target-folder/hourly_table
--srcPattern .*\.log
--outputManifest=manifest-newFile.gz
--previousManifest=s3://log-bucket/log-folder/manifest-oldFile.gz
  • AWS Sync command keeps source and target directories in sync.
  • Syncs directories and S3 prefixes. Recursively copies new and updated files from the source directory to the destination.
  • Main difference between CP and SYNC command is that SYNC command never introduce duplicates since it copies only new/modified files.
  • Only creates target folders in the destination if they contain one or more new/modified files.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store