Microsoft Azure's Big Bet on SDN
The demands on the network are growing exponentially as more and more customers are running their applications in the cloud. As a result, hardware-based networks of the past are no longer flexible or cost-effective enough to handle rapidly growing and changing workloads and requirements. This is why Software Defined Networking (SDN) is a cornerstone of Microsoft Azure, the Microsoft public cloud, running Microsoft first and third party applications. With over 100 datacenters globally, operating at hyper-scale, Azure storage and compute usage doubling every six months, and 1000 new Azure users every day, Microsoft has had to learn how to run a software-defined datacenter within its own infrastructure to deliver reliable, scalable Azure services to a rapidly growing user base.
When it comes to networking, Microsoft has had to build a model that provide customers the same level of control over network features and services when running in the cloud as they have on their own dedicated networks. Through one of the largest SDN deployments in the world, Azure delivers rich, flexible, on-demand provisioned per-customer virtual networks (Vnets) to meet this customer need. Vnets have seen an explosive growth in Azure over last few years, reinforcing the need for a reliable, highly scalable method of delivery.
Vnets are built using overlay and Network Functions Virtualization (NFV) technologies implemented in software running on inexpensive servers on top of a shared physical network. The physical network itself is built using commodity gear optimized for performance and reliability. By focusing the hardware on providing a high-speed forward plane and focusing the software on creating a highly flexible control plane, Vnets deliver a wide set of features and functions at a higher scale and reliability than can be delivered on dedicated infrastructure. These industry-leading capabilities include routing and service chaining; access control lists (ACLs), load balancing and IP addressing. Additionally, through VPN gateways or private peering solution called Microsoft Azure ExpressRoute, Azure Vnets can be deployed as extension of a customer’s on premises network, and customers can deploy virtual appliances in arbitrary topology inside Vnets. A key differentiator for Microsoft Vnets is that with Windows AzurePack and System Center, the same technologies are available for private cloud, and custom hardware can plug into the Vnets in private cloud as needed by customers.
The key challenges in delivering Vnets are scale, reliability and security. A public cloud environment like Azure consists of millions of cores and virtual machines (VMs), hosting hundreds of thousands of customers spread across the globe. Vnets must be provisioned in the order of seconds, and millions of Vnet operations must be supported per day. Large Vnets consisting of tens of thousands of VMs must co-exist with small Vnets consisting of one or two VMs. To achieve this, Microsoft uses SDN principles to leverage a combination of distributed highly available controllers and host-based components.
Azure Controllers are organized as a set of inter-connected and hierarchical services. This includes services for MAC management, IP address management, ACLs or connectivity management, and Vnet management. Each service is partitioned to scale, and it runs consensus-based protocols on multiple instances to achieve high availability. A partition manager service is responsible for partitioning the load amongst these services based on subscriptions. Gateway manager services then use the partition service to determine where to route requests.
These services are built using Microsoft’s service platform called Service Fabric. Service Fabric provides a highly available platform for building and hosting application services that automatically update and self-heal. Service Fabric has been battle tested in Azure and in several Microsoft Cloud services such as Azure SQL Database, Cortana, and Azure Data Factory. In addition, there is an address lookup service that is itself implemented as a hierarchical service. NFVs like load balancing and VPNs are implemented as a combination of a distributed control plane and a stateless scale out data plane running on commodity servers. A stateless service called Network Service Manager (NSM) acts as a worker and drives programming from the network controller to the NFV services. NSM also drives programming on all the Azure hosts.
"With SmartNIC, Microsoft is bringing Field Programmable Gate Arrays (FPGAs) technology into servers, to achieve the programmability of SDN with the performance of dedicated hardware"
At the host level, Azure SDN consists of network agents programming a virtual switch. Microsoft’s private cloud and Azure public cloud both use the same SDN v-switch, the Azure Virtual Filtering Platform (VFP). VFP is a match-action table based programmable switch that provides data plane primitives to apply actions on packets, including encap/decap, stateful NAT, quality of service, metering, ACLs, and more. VFP provides stateful (connection-based) matching as a basic primitive, recognizing that users usually want to program rules for connections rather than just packets. VFP implements rule compilation logic and optimized data structures for fast packet processing and fast rule update, caching and tracking all active flows in the system. VFP exposes an easy to program abstract interface to network agents. The agents receive policy from the controllers and program them as match-action rules in VFPAPI, an easy to program abstract SDN interface.
By leveraging host components and doing much of packet processing on each host running in the data center, Azure SDN data plane scales massively–not only does it scale out, but it also scales up nodes from 1G to 10G to now 40G, and constantly increasing. To scale up, Azure has invested heavily in network interface controller (NIC) offloads with Azure SmartNICs. With SmartNIC, Microsoft is bringing Field Programmable Gate Arrays (FPGAs) technology into servers, to achieve the programmability of SDN with the performance of dedicated hardware.
To protect tenants from each other, the Azure host also implements mechanisms for network isolation. Each VM is only allowed L3 connectivity to any other VM (even on the same host) thereby ensuring that a VM cannot hijack traffic for another VM. Protocols such as DHCP and ARP are secured—effectively putting each VM in its own VLAN. Additionally, by virtualizing the address space of each customer’s VMs, Azure SDN ensures that one customer cannot send or receive traffic into another customer’s network, nor into the physical Azure infrastructure.
Azure SDN is built on robust distributed systems technologies and overtime been enriched to match the varied needs of growing set of applications, legacy and new, being deployed into Azure. Without SDN, it would not have been possible to deliver the scale, rich network semantics, or security that customers desire.