Leap Year Bug Causes Massive Microsoft Windows Azure Outage

Matthew Weinberger

March 1, 2012

2 Min Read
Leap Year Bug Causes Massive Microsoft Windows Azure Outage

The Microsoft Windows Azure compute cloud experienced a service disruption from 8:45 p.m. Eastern Feb. 28 to 5:57 a.m. Eastern Feb. 29, 2012, leaving many customers without access to their applications. This time Microsoft had Julius Caesar to blame: The outage apparently was due to a software bug that resulted in an “incorrect time calculation” for yesterday’s Leap Day.

Microsoft Corporate VP of Server and Cloud Bill Laing, the man in charge of Azure’s engineering organization, took to a blog entry to explain the situation and to beg forgiveness: “First let me apologize for any inconvenience this disruption has caused our customers,” Laing wrote.

Here’s Laing’s summary of the issue and how it was (mostly) resolved, from that blog entry:

Yesterday, February 28th, 2012 at 5:45 PM PST Windows Azure operations became aware of an issue impacting the compute service in a number of regions.  The issue was quickly triaged and it was determined to be caused by a software bug.  While final root cause analysis is in progress, this issue appears to be due to a time calculation that was incorrect for the leap year.   Once we discovered the issue we immediately took steps to protect customer services that were already up and running, and began creating a fix for the issue.  The fix was successfully deployed to most of the Windows Azure sub-regions and we restored Windows Azure service availability to the majority of our customers and services by 2:57AM PST, Feb 29th.

At least as of time of writing, though, the Windows Azure Service Health Dashboard is reporting problems in distributed regions, though Laing promised fixes are coming to still-affected users ASAP. It’s worth noting that even at its peak, the disruption only impacted Microsoft Windows Azure compute, not storage.

An interesting side-note to this story: Vineet Jain, CEO of hybrid cloud storage vendor Egynyte, reached out to TalkinCloud with a statement indicating a belief that if Azure users had just kept some of their cloud infrastructure behind the firewall, they could have avoided this costly disruption in their business:

“By maintaining a behind the firewall presence and syncing that to the public cloud, companies are creating an insurance policy just for these situations,” Jain said.

That’s certainly food for thought. In the meanwhile, between this outage and a series of Microsoft Office 365 disruptions, Redmond has a long way to go before customers and partners can likely fully trust Microsoft’s cloud initiatives.

Read more about:

Free Newsletters for the Channel
Register for Your Free Newsletter Now

You May Also Like