A Stormy Weekend in the Cloud| 1 Comment
Starting late Friday night, a major outage within the eastern region of Amazon Web Services, the largest cloud-based hosting provider on the web, left a large swath of companies that depend on cloud services — such as Netflix, Instagram, Pinterest, and others — with hours of downtime, and prompted many to question the reliability of Cloud services, just as they had in last year’s more serious, multi-day EBS outage. Caused (ironically) by a thunderstorm and subsequent power outage, the effects lasted well into Saturday.
At BloomReach, cloud-based services, including Amazon’s, play a huge role in our infrastructure. And necessarily so — as with so many now working with Big Data, we benefit greatly from the ability to spin up large compute clusters for data-intensive applications easily and on demand. We now have over three years’ experience with AWS — and several other cloud providers — in a production environment. Since our services are used for almost every page load by some of the largest, highest-traffic sites on the web, we focus obsessively on the reliability of our services. Internally, these kind of outages are significant concerns that demand a lot of engineering attention, both when designing our infrastructure architectures for high availability, and later, in the “war room,” when the incidents actually occur. We thought we’d share a few of our thoughts on the subject, now the dust has settled.
During this incident, we suffered zero downtime for our critical customer-facing and end user-visible services, including our APIs, CDN, and pixel serving systems. None of our customers experienced any degradation of these services. In this case, this was mostly due to our aggressive use of multiple Availability Zones (AZs) for high availability in the eastern region. We were also assisted by the fact that we avoid using Elastic Block Store (EBS) volumes for critical serving systems, as historically, they have proven to be much less reliable than other parts of AWS infrastructure. This means that EC2 instance outages in one AZ do not trigger outages, and our more critical systems were not impacted by EBS-related outages that followed recovery of power. Finally, even in the case of a full-scale regional outage at AWS, we do have global DNS failover mechanisms to allow us to serve from alternate hosting providers besides AWS with relatively short notice.
In short, in the case of this outage, as well as the extensive EBS outage of April 2011, we were able to maintain continuous uptime on critical services. Still, it definitely wasn’t a walk at the beach. Our corporate website (right here) and dashboard took a multi-hour hiatus as we had to built a new server from the previous day’s backup. We had much manual work to do as EC2 instances, EBS volumes, ELBs, and RDS stores of our less critical back-end processing systems were impaired or lost in the power outage, and then partially recovered (a few of our EC2 instances and EBS volumes didn’t recover, and had to be replaced).
Does the shift of systems to the cloud mean less reliability? Bottom line — we don’t think so. But there are key differences. As with any piece of infrastructure, cloud infrastructure has risks, and you need to mitigate them with redundancy and reduction of dependencies whenever possible. The real differences of cloud-based hosting from traditional hosting are twofold. First, as more and more companies come to rely on the same cloud providers, like AWS, it means that when there are failures, they have increasingly broad impact on the web, across many sites and services. Secondly, when using cloud services, you have much less internal, technical visibility and control, especially for the more complex services like EBS or RDS. AWS and other cloud hosting providers generally cannot reveal detailed internals of their services, for security and business reasons. Effectively, this means that you can’t be sure when or how they will be fixed when things go wrong. Instead, you need to plan for alternatives should these services fail. (This could mean falling back to instance storage instead of EBS, or an alternate — possibly reduced — set of services on another cloud provider with different infrastructure).
In the realm of high availability, it is impossible to plan for all eventualities, or to be 100% certain you have the right design. You only really learn from failure. (Indeed, it’s often best actively to cause failures in a controlled way, so you have a good understanding of the consequences). This is true for traditional data centers as well as cloud-based deployments — and we definitely have learned a lot about this as we have scaled our systems.
In fact, based on our own experience with these failures, we would conclude that AWS has a very good track record of availability, especially when used in a way that minimizes risks. In particular, here are a few of our own technical suggestions and observations:
- Always use multiple AZs — or better, multiple regions, if you can afford the network and latency costs — for critical systems. Test your AZ failover mechanisms.
- S3 has a very good record of durability. EC2, when used across many AZs, has a good record of availability. EBS, unfortunately, compromises a bit on both durability and availability; this is not surprising if you consider its underlying complexity compared to instance storage. For durability, keep your data safely backed up with snapshots, S3, or by other means. For availability, EBS is used in the serving path to mitigate the transience of EC2 instances. But if you can architect to use instance storage instead of EBS, and handle replication and failover yourself, it may be a more reliable alternative for critical serving systems. The same applies to RDS.
- AZs do provide additional levels of reliability. But nonetheless, have backup plans for full-region meltdown. While unlikely, several incidents, such as the failure of the EBS control plane last year and multi-AZ network connectivity problems earlier this year, have indicated it is possible.
- Elastic Mapreduce (EMR) has a somewhat higher incidence of operational failures, but this is usually not a significant problem, since its use cases generally are not time-sensitive.
- Avoid deep tie-in to any particular service offered by one provider. For example, have mechanisms to deploy in other providers besides EC2, and global DNS solutions that allow for failover.
We’d also like to thank the AWS team for working relentlessly to bring out a continuously improving cloud services, for working with us directly on some of these challenges, and for what must have been a very long weekend of work bringing things back online.
Finally, if you have any thoughts on scaling and reliability in the cloud, we on the infrastructure engineering team would be glad to hear from you! Drop us a comment or an e-mail.