Synchronize large objects to S3 efficiently

I need to synchronize about 30GB of git repositories to S3. These repos may contain some very large pack files, on the rough order of 2GB.

I know that S3 has recently added support for large objects, and has new APIs that allow the objects to be uploaded as several parallel chunks. Is there a good command-line tool for Linux that allows me to efficiently synchronize large objects with S3 in a fashion similar to s3sync?

Answer

If these features were added recently, it may not be in the user land tools yet… But anyway I’ll go out on a limb and recommend jets3t. I’ve been using its synchronize tool to keep roughly 96gb of files synchronized to amazon s3.

However you need to be aware that you can’t modify in place or replace a chunk of data in something stored in s3, if something change in one of those 2gb file, you are going to have to re-upload it.

There do exist some tools that will break down the files into “chunks of block size X” that way there is less cost in modifying a file and having to re-upload the entire file. But again this also depends on how the block chunk algorithm and how the file is being modified also…

tl;dr;

  1. If static and not going to be changing, use something like synchronize from jets3t
  2. If its going to be changing over time, consider something like s3fs or one of the other backup systems such as say brackup that will break large files into chunks to be stored on s3 to lowern the cost of modifying the file.
  3. Employ some form of delta/incremental backup that also stores the deltas of the change onto s3 in addition to the original copy of the files.

Attribution
Source : Link , Question Author : emk , Answer Author : Pharaun

Leave a Comment