Windows Azure Europe Cloud Outage: Microsoft Root Cause Analysis

August 3, 2012

2 Min Read

By samdizzy

root cause analysis

Why did Microsoft’s Windows Azure cloud suffer an outage in Western Europe on July 26? The software giant has just published a root cause analysis (RCA) about the incident.

In a blog post, Windows Azure General Manager Mike Neil said the 2 hour, 24 minute connectivity loss involved a West Europe sub-region portion of Microsoft’s cloud.

“Prior to this incident, we added new capacity to the West Europe sub-region in response to increased demand. However, the limit in corresponding devices was not adjusted during the validation process to match this new capacity. Because of a rapid increase in usage in this cluster, the threshold was exceeded, resulting in a sizeable amount of network management messages. The increased management traffic in turn, triggered bugs in some of the cluster’s hardware devices, causing these to reach 100% CPU utilization impacting data traffic.”

Neil described Microsoft’s corrective steps and apologized for the impact and inconvenience the outage caused customers.

My spin? Big cloud services like Windows Azure and Amazon Web Services are still far more reliable than most on-premises IT systems. Skeptical? Imagine how many individuals called their PC support desks today — worldwide — to report personal PC or local server problems. We don’t hear about those thousands of individual nightmares each day. But when a cloud like Azure goes dark for a few hours, it makes headlines worldwide.

Admittedly, that theory provides little consolation to VARs and channel partners that have customers running in a cloud that goes dark. A growing number of channel partners, particularly managed services providers (MSPs), are starting to leverage remote monitoring and management (RMM) software to help customers track cloud SLA issues.

Another side note: Microsoft pro-actively alerted the media about the root cause analysis. Although I don’t always see eye-to-eye with Microsoft, I do respect the company’s tireless efforts to keep the media informed.

Related Topics

Recent in Technologies

Related Topics

Recent in Connectivity

Related Topics

Recent in Security

Related Topics

Related Topics

Recent in Channel Business

Related Topics

Related Topics