How 3 glitches in Azure Active Directory MFA caused a 14-hour long multi-factor authentication outage in Office 365, Azure and Dynamics services

3 min read

Early this week, Microsoft posted a report on what caused the multi-factor authentication outage in its Office 365 and Azure last week, which prevented users from signing into their cloud services for 14 hours.

Microsoft researchers reported that they found out three issues that combined to cause the log-in glitch. Interestingly, all these three glitches occurred within a single system, i.e. Azure Active Directory Multi-Factor Authentication, a service which Microsoft uses to monitor and manage multi-factor login for the Azure, Office 365, and Dynamics services.

According to the Microsoft researchers, “There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the from identification and understanding of these root causes which caused an extended mitigation time.”

All three glitches occurred within a single system: Azure Active Directory Multi-Factor Authentication. Microsoft uses that service to handle multi-factor login for the Azure, Office 364, and Dynamics services.

The three root causes for the multi-factor authentication outage

Microsoft, in their report, discovered three independent root causes. They said that the gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes, which caused an extended mitigation time.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.

2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.

  1. The third identified root cause was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.

On the day of the outage, these glitches first hit EMEA and APAC customers, and the US subscribers.

According to The Register, “Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations. Because the services had presented themselves as healthy, actually identifying and mitigating the trio of bugs took some time.”

Microsoft said, “The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues”. The company is further looking into ways to prevent the repetition of such an outage in the future by reviewing how it handles updates and testing. They also plan to review its internal monitoring services and how it contains failures once they begin.

To know more about this in detail, head over to Microsoft Azure’s official page.

Read Next

A Microsoft Windows bug deactivates Windows 10 Pro licenses and downgrades to Windows 10 Home, users report

Microsoft fixing and testing the Windows 10 October update after file deletion bug

Microsoft announces official support for Windows 10 to build 64-bit ARM apps

 

Savia Lobo
A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

Share this post

Popular

G Suite administrators’ passwords were unhashed for 14 years, notifies Google

Today, Google notified its G Suite administrators that some of their passwords were being stored in an encrypted internal system unhashed, i.e., in plaintext,...