Aftershock: An online movie giant rides a “cloud” tremor

On April 21, 2011 a digital earthquake hit the United States and reverberated around the world. Online businesses rattled and hummed as they coped with a massive vibration that shook their faith in the biggest IT revolution since the Internet: Cloud Computing.

Some businesses shared the same cloud skyscrapers and all went down at once. And, there were some big names in this skyscraper including Netflix, Quora and FourSquare -- the very poster-childs for the cloud revolution.

The result is something IT departments call "downtime" or an "outage", a very troubling phenomenon which affects trust, security and business growth. It's during moments such as these that customers question the stability and security of the cloud in the face of both natural and man-made disasters.

The epicenter for this digital quake took place in Northern Virginia, United States at 1O p.m. The cloud data centers were owned by none other than Amazon, which has become one of the key Infrastructure-as-a-Service (IaaS) providers in the U.S. along with others, such as Virtual Internet.

One by one, FourSquare, Reddit and others tweeted downtime, since the cheap Amazon EC2 and Database services largely power their digital assets.

Amazon traced the issue to "stuck volumes" or a failure to read and write to disk in a subset of their Elastic Book Stores (EBS).

"The trigger for this event was a network configuration change. We will audit our change process and increase the automation to prevent this mistake from happening in the future," said Amazon.

"However, we focus on building software and services to survive failures. Much of the work that will come out of this event will be to further protect the EBS service in the face of a similar failure in the future."

But, while troubling, the event was also extremely valuable because not everybody was affected the same way or to the same degree. One of them was Netflix.

"When we redesigned for cloud this Amazon failure was exactly the sort of issue that we wanted to be resilient too," said Netflix.

The tale of Netflix is thus a tale of survival in the cloud. It is also a roll call to action for IT managers involved in migrating their services to virtualized platforms.

Netflix had been waiting for an event like this to happen. It tested three principles built into their new networks:

  • Fail Fast: Set aggressive timeouts such that failing components don’t make the entire system crawl to a halt.
  • Fallbacks: Each feature is designed to degrade or fall back to a lower quality representation. For example if we cannot generate personalized rows of movies for a user we will fall back to cached (stale) or un-personalized results.
  • Feature Removal: If a feature is non-critical then if it’s slow we may remove the feature from any given page to prevent it from impacting the member experience.

Further, they worked redundancy into the system. "When provisioning capacity, we use AWS reserved instances in every zone, and reserve more than we actually use, so that we will have guaranteed capacity to allocate if any one zone fails," said Netflix.

Netflix also leveraged NoSQL solutions wherever possible to take advantage of the added availability and durability -- even though this came at some expense to consistency. This allowed their system to degrade; not fail.

Netflix admitted that much of the danger was averted with manual actions to remove dependence on the Amazon East Coast zones. As it grows and becomes a worldwide operation it will need to automate this intervention.

Further, architectural limitations with load balancing affected front-end services. “This meant that when the outage happened, we had to manually update all of our ELB endpoints to completely avoid the failed zone," said Netflix.

One of the biggest lessons learned by the team involved embracing failure.

"One of the first systems our engineers built in AWS is called the Chaos Monkey. The Chaos Monkey’s job is to randomly kill instances and services within our architecture. If we aren’t constantly testing our ability to succeed despite failure, then it isn’t likely to work when it matters most – in the event of an unexpected outage"

Industry observers offered their own thoughts on the Amazon failure, especially with regards Web Hosting Service Level Agreements (SLA).

“When you make the move into the cloud, you are doing so exactly because you want to give up control over the infrastructure level,” said the O’Reilly blog.

“The knee-jerk reaction is to look for an SLA from your cloud provider to cover this lack of control. The better reaction is to deploy applications in the cloud designed to make your lack of control irrelevant. It's not simply an availability issue; it also extends to other aspects of cloud computing like security and governance. You don't need no stinking SLA. “

Related Stories

Leave a comment


This will only be used to quickly provide signup information and will not allow us to post to your account or appear on your timeline.