Security

How 3 glitches in Azure Active Directory MFA caused a 14-hour long multi-factor authentication outage in Office 365, Azure and Dynamics services

3 min read

Early this week, Microsoft posted a report on what caused the multi-factor authentication outage in its Office 365 and Azure last week, which prevented users from signing into their cloud services for 14 hours.

Microsoft researchers reported that they found out three issues that combined to cause the log-in glitch. Interestingly, all these three glitches occurred within a single system, i.e. Azure Active Directory Multi-Factor Authentication, a service which Microsoft uses to monitor and manage multi-factor login for the Azure, Office 365, and Dynamics services.

According to the Microsoft researchers, “There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the from identification and understanding of these root causes which caused an extended mitigation time.”

All three glitches occurred within a single system: Azure Active Directory Multi-Factor Authentication. Microsoft uses that service to handle multi-factor login for the Azure, Office 364, and Dynamics services.

The three root causes for the multi-factor authentication outage

Microsoft, in their report, discovered three independent root causes. They said that the gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes, which caused an extended mitigation time.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.

2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.

  1. The third identified root cause was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.

On the day of the outage, these glitches first hit EMEA and APAC customers, and the US subscribers.

According to The Register, “Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations. Because the services had presented themselves as healthy, actually identifying and mitigating the trio of bugs took some time.”

Microsoft said, “The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues”. The company is further looking into ways to prevent the repetition of such an outage in the future by reviewing how it handles updates and testing. They also plan to review its internal monitoring services and how it contains failures once they begin.

To know more about this in detail, head over to Microsoft Azure’s official page.

Read Next

A Microsoft Windows bug deactivates Windows 10 Pro licenses and downgrades to Windows 10 Home, users report

Microsoft fixing and testing the Windows 10 October update after file deletion bug

Microsoft announces official support for Windows 10 to build 64-bit ARM apps

 

Savia Lobo

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

Share
Published by
Savia Lobo

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago