So the title of this post is a play on lyrics from the Firefly theme song which occurred to me at the recent cloud workshop as I attempted to amuse my cloud friends, some of whom I think occasionally feel a bit like the crew of the Serenity as they retreat from a sort of Traditional IT Alliance. There was much discussion of the AWS partial service outage at the Cloud Computing Workshop last weekend organized by the Silicon Valley Cloud Computing Group. The outage has been spun in every possible way including suggestions it demonstrates cloud services are inherently unsafe. I think the more interesting take-away from the outage is that certain organizations' disaster avoidance plans - something we can do in the cloud, but less often in traditional IT - worked.
The notion of disaster avoidance is kind of like disaster recovery (DR) on steroids. Traditional DR often consists of a standby or secondary DR site with a plan to fail over to the secondary site with some acceptable loss of state or data. These fail-over plans tend to become stale and can be complicated by unforeseen technical issues which can increase the time and effort required for fail-over. This fail-over time is often called a recovery time objective, or RTO. The secondary environment generally has to equal the primary in hardware capacity and can be a large capex that is not always highly productive or well monetized.
Disaster avoidance scenarios feature multiple live application instances in different geographic locations that each service transactions according to load balancing or regional distribution. If the instances continuously replicate state, a failure of one instance can theoretically yield no loss of service as transactions are re-routed to remaining instances. In reality, there is probably not a way to have zero service interruption, as transactions in-flight in a failing environment can be lost, but minimal loss seems feasible. Disaster avoidance is fantastically expensive and hard to do in the world of traditional IT but it seems people are succeeding in building DA in the cloud, where a given consumer's infrastructure cost is tiny fraction of the whole.
At the workshop, Adrian Cockcroft and Chris Pinkham discussed the outage and the Netflix success story which features disaster avoidance. Adrian described how they redesigned their platform to survive both atomic and systemic failures and how these design decisions served them well as they became aware of the degradations in the affected AWS region and decided to evacuate the region. The outage began in the wee hours of the morning when Netflix's traffic load was at it lowest point of the day. This allowed Netflix to remain operational in the zone as the application workload was low enough that it continued to function despite the infrastructure degradation. As they realized the outage would continue into the peak demand period of the new day, they made the decision to evacuate the zone by reconfiguring load balancers to direct all traffic to the unaffected zones. This was the first time they had done this and Adrian remarked that it was harder than they would have liked but worked in the end and they were able to evacuate the problem zone before entering their peak demand period and consequently had a minimal service disruption. Afterward I asked how they did this with no significant loss of application state; the answer seems to be that their data layers utilize use S3 and/or SimpleDB and so this data gets replicated between regions automagically for the most part.
Adrian also have an interesting talk recounting the Netflix experience in moving to AWS after realizing that it was impossible to build and operate data centers fast enough to accommodate their accelerating, and unpredictable, growth rate. They also realized they desired to have disaster avoidance capabilities, in support of very high availability, and the cost of doing this in the cloud was an order of magnitude less.
There was much discussion of failure scenarios; a significant disruption to SimpleDB or S3 control planes, for example, could have made the outage more severe. AWS will probably design more redundancy, regional fault tolerance and survivability in the outcome of this and the next incident may well have a smaller impact. In the final analysis, all complex systems occasionally have service degradations and most of us have experienced a complex cascade failure during our careers regardless of whether we use cloud based or traditional on-premise infrastructure. In the long run, it's probably not a bad strategy to utilize multiple cloud providers both for disaster avoidance and in order to leverage these options during contract negotiations. While it might not be feasible for an application the size of Netflix to operate on multiple providers, it's certainly an option for many of the rest of us.
Chris Pinkham made some interesting remarks on troubleshooting complex failures at scale to the effect that looking for "root causes" in very complex systems is the wrong approach. Complex systems at scale almost always have some level of failures and degradations present and major outages are not so much caused, as triggered, by a confluence of problems that combine in just the right way. Identifying these sorts of triggers, he argued, is a better method for getting to high uptime at scale. He also made some insightful remarks that very complex systems sometimes cannot be fully tested and threat modeled for fault conditions and the best approach is to plan for occasional degradations and design for survivability at the systemic level.
John Adams give a very interesting talk on Twitter's experience growing and operating at scale which provided a look inside their technical operations.