Microsoft Azure Outage: What Went Wrong?
In a post on the Azure blog, Jason Zander, corporate vice president for Microsoft Azure, noted that the issues caused service interruption across the United States, Europe and in parts of Asia. The outage, as we mentioned yesterday, took down several Microsoft services, as well as third-party cloud services being hosted on Microsoft’s infrastructure as a service (IaaS) offering.
Zander apologized to customers before providing some additional background on the outage — a good move, especially considering Microsoft took its share of flack for the way it handled communication in what was a major service outage.
According to Zander, the performance update scheduled for Azure Storage was underway when “an issue was discovered that resulted in reduced capacity across services utilizing Azure Storage, including virtual machines, Visual Studio Online, websites, search and other Microsoft services.”
Zander explained further: “During the rollout we discovered an issue that resulted in storage blob front ends going into an infinite loop, which had gone undetected during flighting. The net result was an inability for the front ends to take on further traffic, which in turn caused other services built on top to experience issues.”
Unfortunately, discovery of the issue was only the first step. The change that caused the issue was quickly rolled back, Zander wrote, but a restart of the storage front ends was necessary to undo the damage. It took some time to get customers back online, and a small group of customers were still experiencing issues yesterday when Zander posted his blog, but almost everything seemed to be working by 11:45 a.m. UTC, approximately 11 hours after the service disruption began.