5 min read
“Yesterday, as a result of a server configuration change, many people had trouble accessing our apps and services. We’ve now resolved the issues and our systems are recovering. We’re very sorry for the inconvenience and appreciate everyone’s patience,” said Facebook in a tweet.
The outage started at around 09:00 Pacific Time (16:00 UTC) on Wednesday and wasn’t fully resolved until 23:00 (06:00 UTC) – an extraordinary delay for a service used by billions globally.
That brief and vague explanation – with no promise of an in-depth report to come – has left users and observers surprised and disappointed. Any company providing a service of similar size and impact, such as a network operator, would be expected to provide constant updates and make its executives available to publicly explain what went wrong.
It’s not like Facebook is allergic to revealing technical details about itself: it has a whole sub-site dedicated to its internal software and data-center engineering work, though there’s not a word about its latest outage. In contrast, Google suffered a cloud platform outage, too, for about four hours yesterday, and its postmortem is detailed: a key part of its backend storage system was overloaded with requests after changes were made to counter a sudden demand for space to hold object metadata, ultimately causing services to stall.
Similarly in January Microsoft faced an outage of approximately 4 hours which affected its various cloud services. They identified it as third party network provider issue affecting authentication to Microsoft accounts and they immediately shifted them to an alternate network provider. Further providing the users a detailed report on the outage issue.
Unlike almost every other company running a communications service for millions of users, Facebook does not even provide a system status dashboard for the public. It has a dashboard for app developers. “We are currently experiencing issues that may cause some API requests to take longer or fail unexpectedly. We are investigating the issue and working on a resolution,” it noted a few hours ago, somewhat stating the bleeding obvious.
While communications companies go out of their way to reach out to media outlets and explain major multi-hour outages in order to maintain public confidence in their network.
Facebook seems to feel no obligation to do so.
We need fair explanation!
Digging into the limited explanation of a “server configuration change” as the source of the problem, that terminology is so vague as to be useless:
- What sort of change?
- On what servers?
- What was the change intended to achieve?
- Was it tested beforehand?
- Was it rolled out gradually, or suddenly across all regions – and if the latter, why?
- Why was a rollback not immediately initiated?
- And if it was, why didn’t it work?
- Why did it take 14 hours to resolve?
These some are questions that you would expect a huge technology company to provide answers to.
Instead, the best explanation we’ve found is a hypothetical rundown by Facebook’s former chief information security officer Alex Stamos who assumes that Facebook engineers did initiate an automated rollback but that “the automated system doesn’t know how to handle the problem, and gets stuck in some kind of loop that causes more damage. Humans have to step in, stop it, and restart a complex web of interdependent services on hundreds of thousands of systems.”
Just this month, US Senator Elizabeth Warren (D-MA) made the argument that services like Facebook, Google, and Amazon have become so large and so fundamental in the digital era that they should be viewed – and legislated as – “platform utilities” and the revenue making aspects (products, ads etc) of these companies should be broken off as separate entities.
When Facebook even refuses to provide a proper explanation for a 14-hour outage, the argument that there needs to be legislative oversight only grows stronger.
Related to this, yesterday it was revealed by New York Times that Facebook is being investigated by a grand jury for possible criminal charges for sharing people’s private data with other companies without seeking the consent of, or even informing, those that were affected.
Is there more to this than meets the eye?
The other big question is how a “server configuration change” led to not just Facebook but also its Messenger, WhatsApp, and Instagram services going down. One theory which could float around is that Facebook has either connected them up or attempted to connect them up at a low level, merging them into one broad platform. In January, CEO Mark Zuckerberg had announced that his instant-chat applications and social network be intertwined.
Was the outage as a result of Facebook trying to combine systems and get ahead of regulators, especially when this month, an open debate opened up over whether Facebook’s takeover of Instagram and WhatsApp should be rolled back?
The timing of it all makes today’s breaking news of two important top executives leaving Facebook in less than a year, even more enigmatic. CEO Mark Zuckerberg writes about the departure of Chris Cox, Chief Product Officer and Chris Daniel, Whatsapp Vice President on his blog.
We wait and watch for Facebook to come up with detailed explanation though very much unlikely of them.