When Reddit Goes Down, Is The Cloud To Blame?
I’ve recently become addicted to Reddit, the extremely popular (more than 1 billion pageviews per month) social news and views site that breeds in-jokes like fungus. So when Reddit crashed for six hours starting Friday morning, it left many thousands of active users jonesing for a fix. When administrators finally came back with an explanation, they laid part of the blame on themselves — and the rest on Amazon Elastic Block Store (EBS). It’s causing a lot of debate on the role of the cloud when demand gets high.
For every last down and dirty technical detail, feel free to check out Reddit’s official blog entry explaining the outage and how it was fixed. But long story short: Around the same time Amazon EBS started experiencing major latency issues, Reddit’s Postgres replication cluster started having inexplicable, serious issues, with data getting copied onto slave drives but not the masters. Reddit system administrators were forced to bring the site down to attempt to clean up the tangle of data the glitch left behind.
Eventually, they managed to fix the problem on their end at about the same time Amazon finished moving their EBS volumes to better hardware. And while Reddit has nothing but praise for the cost savings EBS has over a traditional, on-premises SAN, Reddit says the Amazon cloud service had reliability issues before this incident. And Reddit is moving off of EBS and onto local storage directly attached to Amazon EC2 instances.
Reddit being Reddit, this has attracted many opinions from armchair cloud experts. ReadWriteCloud has a very good roundup of comments, including several from former Reddit employees claiming the site should have left Amazon Web Services years ago. But the really interesting perspective comes from user “youknowitistrue,” with the upshot of his comment being that it’s an “open secret” that on-premises infrastructure is always cheaper than IaaS clouds.
I think there are many TalkinCloud readers and managed services service providers who would disagree. But it still raises some interesting questions, especially about Amazon Web Services in particular. Reddit may have a ton of visitors a fairly robust feature-set for users, and a ton of content. But it doesn’t serve any media itself, and pointedly keeps Adobe Flash and other complex web design elements to a minimum.
If Amazon Web Services can’t handle Reddit, should you trust it with your customers’ enterprise applications? We certainly welcome your perspectives.