Here’s How the Software-Defined Network that Powers Azure Works
For a couple of decades now, Albert Greenberg has been quietly transforming the internet. Somewhat of a legend in network engineering circles, he’s done some of the fundamental early research work on software-defined networking and has been one of the principal engineering minds responsible for turning AT&T from a phone service to an internet company and for designing one of the world’s largest cloud platforms: Microsoft Azure.
While at AT&T Labs, he created the “fair queuing” scheduling algorithm, which became transformational for the company, and in 2005 he co-authored a paper on using software to centrally control an entire network, outlining the ideas behind what we now call software-defined networks and network functions virtualization.
The IEEE, which called him a visionary who “revolutionized how the industry designs and operates large-scale backbone and data center networks critical to today’s cloud services,” has given him an award, and so have the ACM, SIGCOMM, and the National Academy of Engineering. In short, Greenberg is one of the architects behind the infrastructure that powers the world of business and communications as it exists today.
He came to Microsoft in 2007 as a principal researcher, but for the last seven years Greenberg has been in charge of the design and development of networking and SDN for Azure, a role that encompasses everything from physical data center networking to edge network services and everything in between.
In a recent interview with Data Center Knowledge, Greenberg told us about the key principles behind Azure’s SDN. His Virtual Layer 2 architecture – which he describes as “a Layer 3 cross fabric that spans the entire data center” — is at the heart of the Azure and Bing networks and part of what he calls the software-defined data center.
Automation is key to managing a massive, high-bandwidth network built with commodity components. The network state service that Azure uses as its control plane abstracts away the individual networks; you could think of it as the Desired State Configuration management platform in PowerShell for Windows Server at enormous scale that gets away from “snowflakes,” or individual, manually created custom network configurations.
“The network state service allows us to express the target state, or the desired state, and drive from the observed state to the desired state for each and every device and every virtual network, and every network virtual function,” he explained to us. “The network as a whole is moving always from whatever we observe into the target state through the network state service. It’s running constantly and allows us to manage these huge networks without people making judgement calls. Human beings don’t have many nines in their judgment calls, and we don’t want things to be a judgement call; we want the system to have enough capacity that it’s ok to have failures and just route around it. There’s no other way to get high reliability.”
VL2 relies on directly controlling the ASICs in network switches; there is no dedicated network hardware available to drive a network at this scale. Take ExpressRoute for example — the virtual private connections you can make directly from your own network into Azure provisioned in the virtual Azure network. Greenberg describes it as a virtual data center-scale router. “If we have say 1,000 customers each with 10,000 routes, that’s 100 million routes. You can’t buy a router that has that many routes, but you can distribute it across the data center, and it’s all software-controlled.”
To be able to mix and match the best switch hardware as vendors develop it without wasting time porting its Azure Cloud Switch software every time, Microsoft uses the open source Switch Abstraction Interface (SAI) API to program the ASICs. The company open sourced it via the Open Compute Project, the open source data center technology community launched by Facebook.
That’s how the load balancer for Azure is distributed throughout the entire data center, alongside the load it’s balancing. “We created a software load balancer for the whole data center; it’s not one device or two devices backing each other up; there’s software load-balancing functionality in every single host. That’s how we get incredible scale; we load-balance the load balancer itself!”
And to coordinate all that across a data center with 500,000 servers Azure uses a Virtual Networking SDN controller built on Azure Service Fabric, Microsoft’s cloud framework for building distributed applications.
The fabric is built using microservices. “The idea of microservices was perfect for us,” Greenberg told us. “We love these single-function microservices that are independently deployable and independently updateable. If there’s a problem you roll back, if it fails it’s ok, if we need more copies we grow Service Fabric. We found that allowed us to have a huge leg up in building our resilient controller.”
The network never stops monitoring and testing itself. “We build these massive, high-bandwidth physical networks to take care of any fault and failure, but we also have a huge number of probes running with lots of logs and data, and we use the cloud itself to mine that data so we can diagnose and isolate that specific bad component.”
The data mining is done with the .NET Trill query processing software developed by Microsoft Research; it’s also used for the analytics behind Bing Ads and query processing in Azure Stream Analytics. It’s ideal for Azure networking because it can handle enormous scale.
A single cloud availability region, which can be up to half a million servers, has a single “truth,” or network state, propagated across all those servers. Smaller cluster managers, each overseeing about a thousand servers, receive instructions on what the desired state is for the network and start programming their respective clusters via agents running on each host. Greenberg referred to those agents as “stateless foot soldiers,” which implement instructions from the cluster managers.
“We can build in the resiliency we need by having redundancy at every level in the cluster manager and [at] the regional level. That gives us this overall principle that we apply to compute, to storage, and to the network, of always optimizing for a network-wide view of all the assets to meet network-wide objectives on utilization and allocations. That’s how we’re able to create the software-defined data center.”