Intermedia Back Online After Partial Hosted Exchange Outage
Intermedia, one of the largest and best known providers of Hosted Exchange, suffered a partial service outage on April 15. Business seemed to be getting back to normal by April 16. Here’s a look at what happened, and how Intermedia got its customers and partners back online. Plus, a look at the way Intermedia continued to communicate with the media during the outage, instead of going radio silent.
First, a little background and some qualifiers. The VAR Guy’s parent, Nine Lives Media Inc., depends on a range of hosted and cloud services. Our resident blogger knows how painful it can be when online systems go dark. But even worse, he despises when cloud providers go mum during a service outage.
To Intermedia‘s credit, the company sent The VAR Guy updates on April 16, explaining the service disruption and corrective actions the company took. Here’s a look at what happened, from the eyes of Bob Leibholz, VP of sales and business development at Intermedia. Leibholz provided most of the information the afternoon of April 16, while The VAR Guy was flying from San Francisco to New York. Hence, our resident blogger is a bit delayed with this report.
What Happened: Part 1
At 12:27 p.m. eastern on April 16, Leibholz offered the following overview:
Yesterday morning [April 15, 2010], a SAN hardware failure (one of two main processors rebooted itself in a panic mode!) in Domain 20 impacted 12% of Intermedia’s mailboxes (30,000 of about 250,000). The SAN failure caused a cascade of issues, including an attempt by the system to automatically reboot about 50 servers simultaneously. While the hardware issue was diagnosed and a workaround activated midday, impacted mailboxes experienced serious residual mail queue and connectivity delays. All queues were cleared and mail appeared to be functioning normally by 1 a.m. PT.
This morning [April 16, 2010] at approximately 4.am. pt mail delays reappeared. At this time the problem with the SMTP hubs is unknown, but is most likely fall out from yesterday’s issue. We paused external email delivery to Domain 20 while resolving the problem.
The SAN hardware failure also generated mail filtering issues on Domain 21, creating mail queue delays in 20,000 mailboxes yesterday afternoon. These delays were resolved by early evening, pacific time and everything continues to function normally.
As you know, we use new, premium hardware to minimize downtime. In the event downtime occurs, we are committed to transparency. Our service status web page — publicly available — was updated as the above issues unfolded. We’ve also been in touch personally and repeatedly with affected partners to the extent possible. Per our standard practice, we will also follow-up with a detailed Reason for Outage report that includes specific corrective actions. This report will be issued to all impacted customers.
What Happened, Part II
At 12:56 p.m. on April 16, Leibholz was back with another update:
Right now, all hubs are processing mail and we have started accepting external mail. Delays exist. Delays for sending internally or externally (both only pass through the hubs) is about 15 minutes. External to internal mail delivery will be significantly delayed—possibly a few hours as we catch up and clear out the queue. I have not sent a global update yet, because we just started receiving external mail and I want to make certain the system holds.
Looking Ahead and Lessons Learned
Fast forward to the present (Sunday, April 18, 2010). The VAR Guy is flying to Miami for another conference. But the flight has WiFi, and Leibholz offered the following update over email:
Service was back up for all Domain 20 customers around 1:00 Friday [April 16]. There were mail queues clearing for the 20th domain on Friday afternoon, so there was a delay in inbound mail until that stuff processed. But my understanding is that all queues were cleared later that afternoon. All other domains were fine on Friday.
No doubt, the cloud still isn’t perfect. Everyone from corporate giants (Microsoft, RackSpace, Amazon) to upstarts (ConnectWise, Intermedia) has suffered some form of outage in recent months.
But the best companies remain responsive in their corporate communications. Instead of going silent or speculating about the problem, Intermedia promised The VAR Guy a factual update as soon as some information was available. Leibholz delivered those updates.
April 15 and 16 certainly weren’t fun days for Intermedia and a portion of the company’s partner base. But Intermedia’s track record as a white label service to solutions providers has been strong, The VAR Guy believes.
All that said, The VAR Guy hates even the briefest service outage as much as the next guy…