Microsoft experienced another Azure outage this week and had to turn to Twitter to notify users that they were aware of the problem.
In a tweet, time stamped at 7:12AM - 4 Sep 2018, Azure Support stated “Engineers are aware of an issue affecting resources in South Central US. For continued updates please visit the Azure status page at satus.azure.com."
That message surprised even hardened cloud veterans, such as SIOS Technology’s Dave Bermingham, who serves as the resiliency software provider's technical evangelist and cloud MVP. In a blog, Bermingham recounted his experience with the Azure outage, where he said that he suspected the problem started an hour or two before Microsoft’s tweet. That suspicion was validated by tweets from customers asking @AzureSupport about problems with South Central U.S.
Adding insult to injury, Microsoft’s own recommended updates link, status.azure.com, was also unresponsive, according to Bermingham, making it nearly impossible to ascertain the breadth and depth of the outage, which he suspected was much larger than originally indicated by Microsoft. Upon further investigation, Bermingham discovered that services that relied on Azure Active Directory may have been impacted as well, and customers attempting to provision new subscriptions were encountering problems.
Fast-forward 24 hours, and it seems that some Azure users were still encountering difficulties as evidenced by Microsoft’s 11:00 UTC update.
It wasn't until this afternoon that Redmond declared all systems normal.
As noted, the outage was caused by a severe weather event near one of Microsoft’s data centers, which in turn resulted in a structured power-down process. Bermingham notes that no one can blame Microsoft for a natural disaster such as a lightning strike; however, he says, it should serve as a reminder for partners that responsibility for customer up-time is not all on the cloud provider.
"If your only disaster-recovery plan is to call, tweet and email Microsoft until the issue is resolved, you just received a rude awakening," he said. "It is up to you to ensure you have covered all the bases.”
5 Disaster Disconnects: Survey Shows That Partners Must Educate Customers on BC/DR: We surveyed channel partners and IT pros about the state of business continuity and disaster recovery strategies, and the results show a definite need for channel partners to deliver education on the realities of BC/DR preparedness. Download the free report now.
Bermingham followed that statement with some advice on how MSPs might mitigate disasters such as this particular Azure outage.
- Implement Availability Sets (Fault Domains/Update Domains): In this scenario, even if you built Failover Clusters, or leveraged Azure Load Balancers and Availability Sets, it seems the entire region went offline, so you still would have been out of luck. While it is a best practice to leverage Availability Sets, especially for planned downtime, they aren't bulletproof.
- Define Availability Zones: While not available in the South Central U.S. region yet, the Availability Zones being rolled out in Azure could have minimized the impact of the outage. Assuming the lightning strike impacted only one data center, the other data center in the other Availability Zone should have remained operational. However, outages of other non-regional services, such as Azure Active Directory (AAD), seem to have impacted multiple regions, so he doesn’t think Availability Zones would have isolated customers completely.
- Use global load balancers and cross-region failover clusters: Whether you are building SANLess clusters that cross regions or using global load balancers to spread the load across multiple regions, you could have minimized the impact of the larger outage, but you may have still been susceptible to the AAD outage.
- Adopt hybrid- and cross-cloud architectures: About the only way you could guarantee resiliency in a cloudwide failure scenario such as the one Azure just experienced is a DR plan that includes real-time replication of data to a target outside of the primary cloud provider and a plan in place to bring applications online quickly in this other location. These two locations should be entirely independent and should not rely on services, such as AAD, from your primary location to be available. The DR location could be another cloud provider; in this case AWS or Google Cloud Platform seem like logical alternatives, or it could be your own or a secondary customer data center -- although that kind of defeats the purpose of running in the cloud in the first place. Also, look into new "backup as a service" offerings that can be sold by partners.
- Examine your SaaS options: While software as service such as Azure Active Directory, Azure SQL Database and the many other SaaS offerings from major cloud providers seem enticing, you need to plan for the worst-case scenario. Because you are trusting a business-critical application to a single vendor, you may have very little control in terms of DR options. That includes recovery outside of the current cloud service provider.
For partners, this miight be one of the more difficult truths to present to customers. Partners should vet the SaaS provider's DR options before implementing any critical service. If recovery outside of the cloud isn't an option, line-of-business decision makers need to be aware of that before you sign them up for that service. Minimally, make the business stakeholders aware that if the cloud service provider has a really bad day and that service is offline, there might be nothing you or they can do about it other than call and complain.
According to Birmingham, in the not too distant future, cloud-service customers will start hear more and more about hybrid cloud options and cross-cloud availability, which he says are the only ways an organization can insulate itself from downtime like this latest Azure outage. Partners need to educate themselves and customers on solutions that can create robust high availability, business continuity and DR solutions.