I was able to answer some of my questions by reading through the Windows Azure Drives whitepaper, which explains in detail how the Azure Drive is created using Page Blobs. This means that it should be covered under the Windows Azure Storage SLA that states:
Windows Azure has separate SLA’s for compute and storage. For compute, we guarantee that when you deploy two or more role instances in different fault and upgrade domains your Internet facing roles will have external connectivity at least 99.95% of the time. Additionally, we will monitor all of your individual role instances and guarantee that 99.9% of the time we will detect when a role instance’s process is not running and initiate corrective action.
For storage, we guarantee that at least 99.9% of the time we will successfully process correctly formatted requests that we receive to add, update, read and delete data. We also guarantee that your storage accounts will have connectivity to our Internet gateway.
This gives a yearly downtime window of around 26.28 minutes for web/worker roles and 52.56 minutes for storage or roles that require access to Azure Drives. Windows Azure has regions similar to what Amazon AWS offers, but within regions they do not have distinct Availability Zones. Instead they have Upgrade Domains and Fault Domains, which are used for rolling out updates and locating role instances on different hardware racks. Fault domains are not user configurable, so if you want a higher level of availability you have to setup separate services in another region.
I was not able to find a similar description of how Amazon EBS drives are created, but it appears that they are actually NOT backed by Amazon S3, but instead are a separate storage system. The Amazon S3 SLA provides 99.999999999% durability and 99.99% availability, but all that is mentioned for EBS is:
Amazon EBS volumes are placed in a specific Availability Zone, and can then be attached to instances also in that same Availability Zone.
Each storage volume is automatically replicated within the same Availability Zone. This prevents data loss due to failure of any single hardware component.
Amazon EBS also provides the ability to create point-in-time snapshots of volumes, which are persisted to Amazon S3. These snapshots can be used as the starting point for new Amazon EBS volumes, and protect data for long-term durability. The same snapshot can be used to instantiate as many volumes as you wish.
They also indicate that EBS has an expected annual failure rate of between 0.1% – 0.5% compared with typical hard drives which fail at around 4% a year. Since EBS volumes are based entirely in one Availability Zone it is also important to create snapshots for backups:
EBS volumes have redundancy built-in, which means that they will not fail if an individual drive fails or some other single failure occurs. But they are not as redundant as S3 storage which replicates data into multiple availability zones: an EBS volume lives entirely in one availability zone. This means that making snapshot backups, which are stored in S3, is important for long-term data safeguarding.
The post mortem report for the recent EBS/EC2 outage has a lot more detail about the architecture of EBS and indicates that the trigger was an invalid network configuration change. That change caused a number of volumes to become disassociated with their mirrors and quickly led to a “re-mirroring storm,” where a large number of volumes were effectively “stuck” while the nodes searched the cluster for the storage space it needed for its new replica.
This combined with a few race conditions, improper back-off timeouts, and software bugs caused the prolonged outage that affected multiple availability zones. Amazon has stated that they are taking a number of actions to prevent this from occurring in the future, including making the EBS control plane more tolerant to failures in individual availability zones.
In the end, systems that were designed to expect and tolerate failures were much less effected by the AWS outage. At a minimal any system using Azure Drives or Amazon EBS should create regular backups using the provided snapshot feature and may even want to consider shipping the snapshot to a separate region or completely separate storage provider.