Amazon EBS, snapshots as incremental backups
Asked Answered
B

1

23

I'm working on an automated mechanism for our EBS volumes to be backed up on a daily basis.

I know quite well the steps to create a new snapshot. Apparently it's all quite simple, you have an EBS volume which you can snapshot, and you can restore the snapshot anytime. Fine.

But my concern is about the size of the snapshots, I know these snapshots are stored with compression in S3, and we're going to be charged depending on the size of the snapshots. If we have large amounts of data we'll have a significant amount increase in the invoice for each backup we make.

However, according to Amazon's pages, these snapshots are incremental. That'd solve my problem as the daily backup would only upload the data which has changed since the last snapshot. But this leads me to next question: if the backup is incremental and we're only uploading the modified data, where's the original data being stored? (ie. the first snapshot which obviously couldn't have been done incrementally...)

Unfortunately I haven't been able to find this information all over Amazon's documents.

Does anybody have experience with snapshots and its billing?

I'd appreciate any help, thanks!

Boarder answered 24/6, 2011 at 14:52 Comment(0)
S
39

I don't think that you'll find detailed documentation as to how the snapshots are implemented; it's not something I have come across. They do have documentation for "Projecting Costs". However, I think if you know how it works, you can intuit the bill, and feel more at ease with it.

Note that these snapshots are not "incremental" in the way we may have come to understand that term in the DOS operating system. In DOS, the "archive" bit was set when a file was modified, and an "incremental" backup copied only the files that had it's "archive" bit set. The backup process would clear the archive attribute, so a future edit to the file would cause it to be backed up "incrementally" once again.

With snapshots, each block of the volume is flagged if it is modified. It's not done on a file by file basis. After the first snapshot, only blocks that have been flagged as modified are backed up, just like "incremental" backups in DOS. But that's where the similarities end, because with each block that it doesn't have to copy it doesn't just skip it, it writes a pointer to where the last (unchanged) copy of the data is.

The first snapshot you make of a volume, the data is broken up into blocks. From Amazon: "Volume data is broken up into chunks before being transferred to Amazon S3. While the size of the chunks could change through future optimizations, the number [...] can be estimated by dividing the size of the data that has changed since the last snapshot by 4MB."

The next snapshot you make consists of data for only those blocks that have changed, and pointers to the blocks that haven't changed. Those pointers point to blocks of data in the previous snapshot.

The next snapshot (n) is made by recording data of each block changed since the previous snapshot (n-1), along with pointers for the blocks that haven't changed since the previous snapshot (n-1). These pointers point to corresponding blocks in the previous snapshot, which may contain data, or another pointer to its previous snapshot. Eventually, every pointer ends up at a block of real data, (that hasn't changed since that snapshot was created).

Now let's say you decide to delete snapshot (x). Snapshot (x) has snapshots made before it (x-1), and after it (x+1). Amazon replaces the pointers in snapshot (x+1) with pointers and data from snapshot (x) (the one being deleted). As a result, any actual data in snapshot (x) is copied to snapshot (x+1), unless it has it's own copy of more recent data for that block there.

This is how snapshots work, where the data is stored, and why the size of the snapshots are manageable. You can understand from this how deleting a snapshot will destroy only your ability to bring back the volume as it was at the point in time when that snapshot was created, without destroying the ability to use your other snapshots. Unlike simple, traditional "incremental" backups that don't utilize pointers, snapshots not being deleted are updated as needed to maintain their usefulness when one of its dependent snapshots are deleted. This is why it makes sense that Amazon charges more for intelligent snapshot storage than simple copies of EBS volumes. Finally, it's understandable that it's difficult to predict how much snapshot storage is going to cost, since it is so dynamic.

Sphenoid answered 25/6, 2011 at 2:37 Comment(4)
I find your comment very useful. I'm very interested and curious about that process, however it doesn't seem to be publicly documented. I haven't been very lucky in my Google searches so far, as you said detailed documentation isn't easily available. Do you know any useful link on that matter? The main issue is that we're snapshotting quite a few volumes weekly and don't want to get too scared in the next bill.. Many thanksBoarder
I still don't think that you'll find detailed documentation, and I'm not holding out; I haven't seen any. My understanding is synthesized from study of Data Structures, specifically dual linked lists. No need to fear the next bill. Test snapshotting a few smaller volumes hourly. In Amazon, click on "Account" (found on the far right of the menu above the console) and select "Usage Reports" from the menu on the left. For Service, choose "Amazon Elastic Compute Cloud". For Usage Types, choose "All Usage Types". For Operations, choose "SnapShotPutUsage" or "SnapShotUsage". [Out of room. :)]Sphenoid
You're right I can see the bill before it gets too late. However it'd had been useful to find some proper documentation :(, Amazon is disappointing me a little bit. Thank you again for all your help!!Boarder
@Boarder Can you please update us, how did it turn out for you? How much was your bill?Apostolate

© 2022 - 2024 — McMap. All rights reserved.