Earlier this month, Microsoft Azure Cloud was experiencing problems that left users unable to access its cloud services. The outage in South Central US affected several Azure Cloud services and caused them to go offline for U.S. users. The reason for the outage was stated as “severe weather”. Microsoft is currently conducting a root cause analysis to find out the exact reason.
Many services went offline due to cooling system failure causing the servers to overheat and turn themselves off.
What did the RCA reveal about the Azure outage
High energy storms associated with Hurricane Gordon hit the southern area of Texas near Microsoft Azure’s data centers for South Central US. Many data centers were affected and experienced voltage fluctuations. Lightning-induced increased electrical activity caused significant voltage swells. The rise in voltages, in turn, caused a portion of one data center to switch to generator power.
The power swells also shut down the mechanical cooling systems despite surge suppressors being in place. With the cooling systems being offline, temperatures exceeded the thermal buffer within the cooling system. The safe operational temperature threshold exceeded which initiated an automated shutdown of devices.
The shutdown mechanism is installed to preserve infrastructure and data integrity. But in this incident, the temperatures increased pretty quickly in some areas of the datacenter causing hardware damage before a shutdown could be initiated. Many storage servers and some network devices and power units were damaged.
Microsoft is taking steps to prevent further damage as the storms are still active in the area. They are switching the remaining data centers to generator power to stabilize power supply. For recovery of damaged units, the first step taken was to recover the Azure Software Load Balancers (SLBs) for storage scale units. The next step was to recover the storage servers and the data on them by replacing failed components and migrating data to healthy storage units while validating that no data is corrupted.
The Azure website also states that the “Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.”
A detailed analysis will be available on their website in the coming weeks. For more details on the RCA and customer impact, visit the Azure website.