News

Why did last week’s Azure cloud outage happen? Here’s Microsoft’s Root Cause Analysis Summary.

2 min read

Earlier this month, Microsoft Azure Cloud was experiencing problems that left users unable to access its cloud services. The outage in South Central US affected several Azure Cloud services and caused them to go offline for U.S. users. The reason for the outage was stated as “severe weather”. Microsoft is currently conducting a root cause analysis to find out the exact reason.

Many services went offline due to cooling system failure causing the servers to overheat and turn themselves off.

What did the RCA reveal about the Azure outage

High energy storms associated with Hurricane Gordon hit the southern area of Texas near Microsoft Azure’s data centers for South Central US. Many data centers were affected and experienced voltage fluctuations. Lightning-induced increased electrical activity caused significant voltage swells. The rise in voltages, in turn, caused a portion of one data center to switch to generator power.

The power swells also shut down the mechanical cooling systems despite surge suppressors being in place. With the cooling systems being offline, temperatures exceeded the thermal buffer within the cooling system. The safe operational temperature threshold exceeded which initiated an automated shutdown of devices.

The shutdown mechanism is installed to preserve infrastructure and data integrity. But in this incident, the temperatures increased pretty quickly in some areas of the datacenter causing hardware damage before a shutdown could be initiated. Many storage servers and some network devices and power units were damaged.

Microsoft is taking steps to prevent further damage as the storms are still active in the area. They are switching the remaining data centers to generator power to stabilize power supply. For recovery of damaged units, the first step taken was to recover the Azure Software Load Balancers (SLBs) for storage scale units. The next step was to recover the storage servers and the data on them by replacing failed components and migrating data to healthy storage units while validating that no data is corrupted.

The Azure website also states that the “Impacted customers will receive a credit pursuant to the Microsoft Azure Service Level Agreement, in their October billing statement.

A detailed analysis will be available on their website in the coming weeks. For more details on the RCA and customer impact, visit the Azure website.

Read next

Real clouds take out Microsoft’s Azure Cloud; users, developers suffer indefinite Azure outage

Microsoft Azure’s new governance DApp: An enterprise blockchain without mining

Microsoft Azure now supports NVIDIA GPU Cloud (NGC)

Prasad Ramesh

Data science enthusiast. Cycling, music, food, movies. Likes FPS and strategy games.

Share
Published by
Prasad Ramesh

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago