3 min read
Microsoft researchers reported that they found out three issues that combined to cause the log-in glitch. Interestingly, all these three glitches occurred within a single system, i.e. Azure Active Directory Multi-Factor Authentication, a service which Microsoft uses to monitor and manage multi-factor login for the Azure, Office 365, and Dynamics services.
According to the Microsoft researchers, “There were three independent root causes discovered. In addition, gaps in telemetry and monitoring for the MFA services delayed the from identification and understanding of these root causes which caused an extended mitigation time.”
All three glitches occurred within a single system: Azure Active Directory Multi-Factor Authentication. Microsoft uses that service to handle multi-factor login for the Azure, Office 364, and Dynamics services.
The three root causes for the multi-factor authentication outage
Microsoft, in their report, discovered three independent root causes. They said that the gaps in telemetry and monitoring for the MFA services delayed the identification and understanding of these root causes, which caused an extended mitigation time.
1. The first root cause manifested as latency issue in the MFA frontend’s communication to its cache services. This issue began under high load once a certain traffic threshold was reached. Once the MFA services experienced this first issue, they became more likely to trigger second root cause.
2. The second root cause is a race condition in processing responses from the MFA backend server that led to recycles of the MFA frontend server processes which can trigger additional latency and the third root cause (below) on the MFA backend.
- The third identified root cause was previously undetected issue in the backend MFA server that was triggered by the second root cause. This issue causes accumulation of processes on the MFA backend leading to resource exhaustion on the backend at which point it was unable to process any further requests from the MFA frontend while otherwise appearing healthy in our monitoring.
On the day of the outage, these glitches first hit EMEA and APAC customers, and the US subscribers.
According to The Register, “Microsoft would eventually solve the problem by turning the servers off and on again after applying mitigations. Because the services had presented themselves as healthy, actually identifying and mitigating the trio of bugs took some time.”
Microsoft said, “The initial diagnosis of these issues was difficult because the various events impacting the service were overlapping and did not manifest as separate issues”. The company is further looking into ways to prevent the repetition of such an outage in the future by reviewing how it handles updates and testing. They also plan to review its internal monitoring services and how it contains failures once they begin.
To know more about this in detail, head over to Microsoft Azure’s official page.