Inside the Microsoft Azure Datacenter Architecture
This is a summary of a great session by Mark Russinovich talking about Azure
Inside Azure Datacenters
Data center design
When Microsoft plans new data centers they start with geographies (Geos) which are gradually coming down to country (Germany, France, UK, etc). Within a geo there are at least two regions (region pair). A region pair is defined by a bandwidth and latency envelope
- < 2ms latency diameter (round trip)
- Customers see regions, not data centers
- Different fault and flood zones, electrical grids, hurricane zones, etc
- Typically hundreds of miles apart
Within a region are Availability Zones to allow very fast failover.
- < 600microsecond latency diameter
- At least three AZs to allow quorum based failover
- Subscriptions are striped in groups of 3 over multiple physical AZs
Availability zones are in preview for VMs, Managed Disks, VIPs, VMSS and LB
Data center security
Data center security is multi-level, including access approval, perimeter, building and server environment. For example a technician coming to the data center is cleared at the gate, then in the data center and then only given access to the precise location within the data center where they need to perform their work.
Data center sustainability
Data center sustainability is an ongoing area of improvement. Today the data centers are carbon neutral, by 2018 50% of energy use is intended to come from renewables with a long term goal to only use renewable energy to power the data centers. The renewables used is based on the location of the data center. To help reduce energy requirements Microsoft is also developing fuel cells to reduce the energy required to be delivered to the cabinet.
Inside Azure Physical Networking
Microsoft have a global optical network (over 30,000 miles of dark fibre), with an inter-region backbone using SWAN for traffic shaping and an intra-region collection of Regional Network Gateways. There are over 4500 peering point and 130 edge locations. The Marea cable (joint venture between Microsoft, Facebook and Telsus) went live last month between Virginia and Spain, as they noticed a over-reliance on the existing trans-Atlantic cable between New York and England. The capacity on the new fibre cable is 160 Tbps.
Regional Network Gateways
Regional Network Gateways (RNG) are built on 100 Gbps network optics, two RNGs in a region with Small, Medium and Large t-shirt sizes depending on the number of data centers they are serving. It is based on a Microsoft technology project called Madison which uses QSFP28 form factor connectors using less than 4.5W per plug and has an 80km reach using 40 channels. This allows the deployment of 1.689 Pbps of inter data center switching, changing 16×2 racks of networking gear with traditional equipment down to 2 racks with a massively reduced power consumption to boot.
Inside Azure Logical Networking
Software Defined Networking (SDN) consists of a Management, Control, and Data plane. For example a Management plane: create a tenant; Control plane: Plumb tenant ACLs to switches; Data plane: Apply ACLs to flows on the hosts. The idea is to push as much of the configuration down to the host and have a scalable management architecture sitting on top of it.
Here is the overall Azure Architecture
Focusing in on the networking we have Resource Providers (RP) in particular the Network Resource Provider, and underneath that is the Network Regional Manager which manages resources are the region level. The Network RP works with the Compute RP to come together on a server through a Node Agent and the Network Agent within that, which manages the load balancers, virtual network, and virtual platform.
Microsoft are being asked more and more to keep services of the Internet. VNet service endpoints are what is used to keep communications off the Internet. Today this is supported for Storage accounts and Azure Databases, but the aim is to make this universally available within the next 12 months or so. So you can define rules to say only allow access to this storage account from this VNet which will remove access from any other place including over the Internet.
With all of this logic being pushed down to the host there is a lot of processing to do… Host SDN scaling is currently at 40 Gbps soon to be 50Gbps and ultimately looking at 100 Gbps down to the host. Now this burns a lot of CPU cycles at 40Gbps, so what Microsoft have done is replace vSwitches with FPGA based SmartNics. This removes the CPU load and reduces the latency from 115 microseconds down to 30 microseconds, this allows FPGA to achieve near line speed throughput. All of this is while still retaining all of the ACL and Load Balancing logic in the Filtering Platform.
Inside Azure Servers
The servers in the Azure data centers moved from a gen 2 server (1 Gbps network, 32 GB RAM) to gen 6 (50 Gbps network, 192 GB RAM) The gen 6 server is more aligned to what we see people use in terms of workloads.
The gen 6 server contains FPGA, high density hard discs, m.2 SSD, Intel/AMD/ARM64 processors and battery backed RAM. It also contains a NIST compliant custom ASIC which is designed to enforce preboot integrity, boot integrity and run time integrity of all of the firmware components in the servers.
We’re using FPGAs to accelerate everything, and are working on a project to accelerate deep learning to the point where it can be real-time (under 1 ms) rather than the current batch operation. The current gen delivers around 1.5 Teraflops, then next gen looks to be around the 40 Teraflops.
Inside Azure Compute
Going back to the overall Azure architecture, we now want to drill down on the Compute Resource Provider.
The story starts with the Azure Resource Manager (ARM) which is a global service run as micro-services on the service fabric. It is a weakly consistent and allows failure of regions without impact to the service. This is how the Azure portal can show you all resources regardless of which regions that you are in.
Beneath ARM we can see resource providers that are regional in nature, such as compute, regional network manager, regional directory service, software load balancer. These again are tolerant to failures where failures within one region won’t impact services from another region.
Underneath that we have clusters. In the compute world this is between 1,000 and 3,000 servers each of which is called a node where there are agents that support the higher level services, just like in the SDN stack
The Compute Resource Provider consists of at least seven micro-services including repository for images and extensions, patch management, orchestrator, disk manager and the of course the VM manager itself. The diagram below shows the relationships between the various sub-components.
To show you the power of the service fabric, Mark did a demo of using a single cluster of just over 3,500 virtual servers where he deployed containers to. The time measured was for the container to actually be in a running state. Over 1,000,000 containers were deployed in 1 minute and 45 seconds!
The compute platform is being made more secure using ‘confidential computing’ This is a new service aimed at protecting confidential data. The design takes Microsoft out of the trust equation through the use of trusted execution enclaves (tees). It is a black box where your code and data meet, the tee can prove to the CPU that it is running some trusted code and is given the decryption keys. It can then run the code and access the data but nothing else outside of the tee can. Today the tee uses Virtual Secure Mode (VSM) from hyper-v, but the other tee will use software guard extensions from Intel. The advantage of this approach over see SQL Always Encrypted is that instead of just being able to determine equivalence you can do any computations or queries that you like against that encrypted data and the tee will process that based on the decrypted data within the tee itself
Discussions about storage now generally come down to capacity. Microsoft are seeing systems that need exobytes to zetabytes of data. Of course this means that Microsoft is focused on reducing the cost associated with storage. The gen 0 system is shown below as a cost of 1 and the current gen 4 system is 0.05 the cost of a gen 0. Microsoft believes that there is still over 80% efficiency gains in flash and hard disk technologies and over 95% for tape. The consistent factor seems to be the ratio of price per byte between flash, hard disk and tape. It’s roughly 10 to 1 from flash to disk and again 10 to 1 from hard disk to tape.
At Microsoft tape is used in Azure to provide archival data, with a rack containing two robots and 72 tape drives, allowing 16 of these side by side for expansion. A Microsoft research project called Pelican aims to use disk in a giant storage server, with over 1,100 hard disks. The smart piece in this is that Pelican only powers up the disks needed for the operation undertaken. Pelican is intended to be deployed in data centers that are not big enough to house a tape expansion area.
Microsoft has another project called silica which stores data on glass. The issue has always been to get lasers to pulse short enough, but that’s now been solved. A femtosecond laser burns a pattern of voxels 6 x 6 in a 25 micron square, with each voxel storing 8 bits into the glass, so a piece of glass about 1 inch square can hold 50 TB of data!
Project Palix is a storage project that looks to use DNA to store data. It is 1,000 times denser than tape and it can survive for over 2,000 years without rewriting. The idea is that you encode the data using gene sequencing. If successful this will be a quantum breakthrough in storage. A single rack of storage could then hold 1 zettabyte of data.
Inside the Azure OS
Windows Azure is built on Windows Server 2016. There are customizations to really optimise the OS for a hyperscale cloud such as
- removal of all 32 bit code
- no managed code
- no language packs
- remove all unused drives
- minimal roles
There is still the challenge of updating the OS. A lot of updates require a reboot, and when you’re talking tens of thousands of servers in a data center running a large number of virtual machines that it quite an impact.
To help with that Microsoft are working towards reducing the need for reboots, in fact they’re confident that they’ll get to a place where reboots are not needed at all for patching.
Virtual Machine Preserving Host Update (VM-PHU)
The idea is to replace the host OS with a new host OS whilst pausing the VMs for a short period of time. The aim is 9 seconds which is the TCP timeout in a reliable network.
VM-PHU light can replace the entire virtualization stack except for the hypervisor itself, so we can replace networking and storage stacks in around 3 seconds.
VM-PHU Ultra Light
WM-PHU ultra light can replace everything in the virtualization stack except for the one that manages the core state. All of the fixes done in Azure could have been patches with ultra light. This takes around 200 ms.
Hot Patching aims to reduce the impact to near zero. Not every change is suitable for hot patching, but it has been used over 50 times in Azure to address vulnerabilities. The idea is that the code itself is modified so that instead of calling the known bad function it calls a known good function instead. The changes are made while the system is running, and a demo was done showing a hot patch on a network driver showing no loss of traffic at all in order to address a bug in a vSwitch.
As you can see the session covered a lot of areas, and to be honest it totally blew us away. It is well worth watching if you have a spare 90 minutes!