AWS : Backup with Amazon Glacier

Introduction

Paranoia. It is a condition I live with when it comes to backing up my data. Almost every aspect of my life has a digital footprint of some size, and with so much invested in my digital trove, I go to great lengths to protect it.

I have one primary hard drive, which is the master copy and is in a military specification waterproof and drop resistant housing. Mirrored frequently from that are two backup drives, also in military specification waterproof and drop resistant housing, which are stored in a 1 hour rated fireproof and waterproof safe inside my house. Mirrored quarterly from that is a third backup drive, stored in a portable fireproof safe at an offsite location.

Needless to say, I still feel very exposed and am always looking for additional protection!

A Little About Glacier

As soon as I read about Amazon Glacier, I was immediately intrigued. Glacier is an archival data store provided by Amazon, designed to store large amounts of data, very inexpensively, in an extremely redundant, durable way. The major caveat is that it can take a long time (4-5 hours) to retrieve your data once it is archived. When one is considering the prospect of restoring one's entire digital life however, that seems like a minor limitation...

Amazon advertises "eleven nines" of durability. That is a 99.99999999999% chance that your data will be safe after a year which is very impressive. In addition to the durability of its data store, Glacier also utilizes checksumming to preserve file integrity. Much like file systems that use checksumming (such as ZFS on Linux) this protects your files from "bitrot" and other phenomena that could cause file corruption even in a durable system. 

Glacier is very inexpensive, at least compared to other cloud personal backup solutions. At only $0.01 per GB per month, my entire holdings cost me a mere $5 a month. Compared to other services that are charging an average of $10 a month for a mere 100 GB this is very attractive.

Glacier organizes data into vaults and archives. A vault is a collection of archives and is intended to organize your data at a very high level. An archive is equivalent to a single file. If you need to restore your data you can request individual archives and do not need to restore the entire vault at once.

A very common use case for Glacier is to store legacy data that does not need to be accessed (frequently at least) but must be stored regardless. For example, old log files, prior year financial records, and so forth.

S3 and Glacier can work closely together through the use of Lifecycle Policies, allowing you to automatically migrate objects directly from S3 into Glacier based on certain conditions. As such, Glacier is not really optimized for the end user. It is really intended as a system service.

But that has never stopped you before, now has it? So how do you use Amazon Glacier for personal backups? 

There are two approaches, one simple, the other more advanced.

Create a Vault

Before you send any data to Glacier you need to create a vault. This is quite simple. Go to the Management Console. Click on the Glacier icon to go to the Glacier console. From here, on the upper left, there is a Create Vault button. All you need to do is provide a name and your vault will be created.

Like most other things in AWS, Glacier vaults reside in a specific Region. Migrating data from S3 to Glacier within the same Region is free. But if you migrate data across Regions there is an additional surchage.

Now, on to using Glacier for personal backups!

Backups: Third Party GUI

Amazon does not provide a robust GUI interface for Amazon Glacier. After all... That is not what they really inteneded it for. But there are a host of enterprising individuals who have used the Amazon APIs to create excellent third party clients. There are several to choose from:

My personal preference is FastGlacier

NOTE: FastGlacier is Windows-only which may be too limiting for some people. I no longer have a Mac, but I am told that CloudBerry Backup is the Mac user's client of choice.

Setting up your client, will of course vary based on which you choose. FastGlacier prompts you to create an account in which you store your Amazon Security Credentials. (See my first AWS blog for my information on retrieving these.) These credentials will be used to list your vaults and the archives they contain, and also handle uploading files for you.

That is really all there is to it. The third party tooling has made accessing Glacier quite easy and for many people they will more than meet your needs!

Backups: Java API

If you want a real challenge though, you can create your own client using the Java API. Under the covers, all third party clients are using the Amazon API, so any functionality you see in a third party application can be implemented by you as well.

There are two methods for uploading files in Glacier: file at a time and multipart. File at a time is the simplest, so we will look at that first. Below is the relevant code:

AmazonGlacierClient client = new AmazonGlacierClient(new ClasspathPropertiesFileCredentialsProvider());

client.setRegion(Region.getRegion(Regions.US_WEST_2));

ArchiveTransferManager atm = new ArchiveTransferManager(client, new ClasspathPropertiesFileCredentialsProvider());

UploadResult result = atm.upload(  
    "<YOUR VAULT NAME HERE>",
    "<YOUR FILE NAME HERE>",
    new File("<YOUR FILE NAME HERE>")
);

NOTE: Code can be downloaded from GitHub here.

As always with AWS, the first thing you do is create a client object for the AWS service you are using (AmazonGlacierClient in this case.) You should set the Region appropriate for you as well. As with here we are using the ClasspathPropertiesFileCredentialsProvider which expects an AwsCredentials.properties file with your Security Credentials.

The API provides a high level ArchiveTransferManager that allows you to post a file in a single request. Simply call the upload() method with your vault name, archive name, and the file you wish to upload and you are done!

This is all well and good until we try to upload a large file (such as a massive TrueCrypt encrypted file container...) A large sized file is much more likely to fail in the middle of the upload and may not be able to be efficiently uploaded depending on your hardware setup.

As a result Amazon provides an API for doing multipart upload. This allows you to break a file up into parts and upload each part separately. This allows you to recover from a failure and resume uploading where you left off. Depending on your architecture you may also be able to be to achieve higher throughput by parallelizing uploads.

Multipart uploading is a little more difficult. First, there is a lifecycle:

  1. Request a multipart upload and get an upload id to correlate parts to one another.
  2. Upload each part
  3. Request a completion of the multipart upload.

This makes the code somewhat more complicated. Second, recall that Amazon Glacier uses checksumming to preserve file integrity. When you upload a single file, such as in the prior example, Amazon computes the checksum and sends it with the file. When you are doing a multipart upload however, you need to provide the aggregate checksum for all the parts. You also need to provide an indication to Amazon of the ordering of the parts so it can reassemble the file correctly.

The source code for the following code is available here. We will analyze it in sections:

AmazonGlacierClient client = new AmazonGlacierClient(new ClasspathPropertiesFileCredentialsProvider());

client.setRegion(Region.getRegion(Regions.US_WEST_2));  

As before, we first create the Glacier client, use classpath properties file security credentials, and configure it with the correct Region.

String fileName = "<YOUR FILE NAME>";

File file = new File(fileName);

// Request a multipart upload.
InitiateMultipartUploadRequest initiateRequest = new InitiateMultipartUploadRequest()  
    .withVaultName("<YOUR VAULT NAME>")
    .withArchiveDescription(fileName)
    .withPartSize("" + partSize);

InitiateMultipartUploadResult initiateResult = client.initiateMultipartUpload(initiateRequest);

// Get an upload id that is used to tie each upload part together
String uploadId = initiateResult.getUploadId();  

Next we create a request to initiate the multipart upload. This notifies Glacier that a multipart upload is incoming and

Glacier responds with the upload id which you will use to unite each subsequent part of the upload.

// Upload each part and collect final checksum.
String checksum = uploadParts(client, file, uploadId);  

The logic to upload each part is sufficiently complex that I pulled it into a separate method. We will revisit it momentarily. For now, note that the uploadParts() method returns a checksum for the entire file that will be used below.

// Conclude the multipart upload.
CompleteMultipartUploadRequest completeRequest = new CompleteMultipartUploadRequest()  
    .withVaultName("<YOUR VAULT NAME>")
    .withUploadId(uploadId)
    .withChecksum(checksum)
    .withArchiveSize(String.valueOf(file.length()));

CompleteMultipartUploadResult completeResult = client.completeMultipartUpload(completeRequest);  

We conclude by creating a request to complete the multipart upload. This signals Glacier to reassemble and then commit the file.

Now back to uploading each part.

private static String uploadParts(AmazonGlacierClient client, File upload, String uploadId) throws Exception {  
    int pos = 0;

    int bytesRead = 0;

    FileInputStream uploadStream = new FileInputStream(upload);

    byte[] uploadBuffer = new byte[partSize];

    List<byte[]> partChecksums = new ArrayList<byte[]>();

    while(pos < upload.length()) {
        bytesRead = uploadStream.read(uploadBuffer, 0, uploadBuffer.length);
        if (bytesRead == -1) { break; }

        byte[] part = Arrays.copyOf(uploadBuffer, bytesRead);
        String partChecksum = TreeHashGenerator.calculateTreeHash(new ByteArrayInputStream(part));

        partChecksums.add(BinaryUtils.fromHex(partChecksum));

        UploadMultipartPartRequest partRequest = new UploadMultipartPartRequest()
            .withVaultName("<YOUR VAULT NAME>")
            .withBody(new ByteArrayInputStream(part))
            .withChecksum(partChecksum)
            .withRange(String.format("bytes %s-%s/*", pos, pos + bytesRead - 1)) // This is a standard format defined by a Java RFC.
            .withUploadId(uploadId);

        System.out.print("Upload: ");

        client.uploadMultipartPart(partRequest);

        System.out.println("SUCCEEDED! (pos = " + pos + ")");

        pos = pos + bytesRead;
    }

    return TreeHashGenerator.calculateTreeHash(partChecksums);
}

A lot is going on there, so we will take it one step at a time.

The first several lines are just setup. The real logic starts with the while loop. A fixed number of bytes is read from the file. Each part can be as large as 4 GB, and Glacier allows a maximum of 10000 parts per multipart upload.

This code uses a default part size of 1 MB (1024 bytes times 1024) however, which means it artificial lowers the maximum file size. Depending on your needs you can change the part size to whatever you need (as long as it is below 4 GB.)

After the part is read, a checksum is calculated for that part and stored in a list of part checksums.

Next, a part upload request is made to Glacier. In that request you specify the vault name, the bytes of the part, the checksum, the range (discussed below), and the upload id. The upload id unites all the part uploads in Glacier so Glacier knows all those parts belong together.

The range is specified in a standard Java format and indicates to Glacier what byte range this part contains. This allows Glacier to reassemble the parts into a single file before committing the file to Glacier.

Once the loop completes, a single checksum is calculated from all the part checksums. This checksum is returned, and is sent along with the complete multipart upload request.

And that is all there is!

Questions? Comments? Email me at: [email protected]!