Update: The article has been updated to include Google’s response on Sunday’s disruption service.
Over the weekend, Google Cloud suffered a major outage taking down a number of Google services, YouTube, GSuite, Gmail, etc. It also affected services dependent on Google such as Snapchat, Nest, Discord, Shopify and more. The problem was first reported by East Coast users in the U.S around 3 PM ET / 12 PM PT, and the company resolved them after more than four hours. According to downdetector, UK, France, Austria, Spain, Brazil, also reported they are suffering from the outage.
Google Cloud down pic.twitter.com/QRWV1KdjSW
— خالد مهنا (@DrKMhana) June 2, 2019
In a statement posted to its Google Cloud Platform the company said it experiencing a multi-region issue with the Google Compute Engine. “We are experiencing high levels of network congestion in the eastern USA, affecting multiple services in Google Cloud, GSuite, and YouTube. Users may see a slow performance or intermittent errors. We believe we have identified the root cause of the congestion and expect to return to normal service shortly,” the company said in a statement.
The issue was sorted four hours after Google acknowledged the downtime. “The network congestion issue in the eastern USA, affecting Google Cloud, G Suite, and YouTube has been resolved for all affected users as of 4:00 pm US/Pacific,” the company said in a statement.
“We will conduct an internal investigation of this issue and make appropriate improvements to our systems to help prevent or minimize future recurrence. We will provide a detailed report of this incident once we have completed our internal investigation. This detailed report will contain information regarding SLA credits.”
This outage resulted in some major suffering. Not only did it impact one of the most used apps by Netziens (YouTube and Sanpchat), people also reported that they were unable to use their NEST controlled devices such as turn on their AC or open their “smart” locks to let people into the house.
Today @googlecloud went down and people can't use their Nest devices to open their doors. 🤦♂️
— David Iach 🚀 (@davidiach) June 2, 2019
Even Shopify experienced problems because of the Google outage, which prevented some stores (both brick-and-mortar and online) from processing credit card payments for hours.
Due to @googlecloud outage, @Shopify has been down all afternoon. Shops running on the platform may have collectively lost $millions already, due to lost sales and ad-spend. pic.twitter.com/NUwZg3lMDA
— Larry Weru (@LarryWeru) June 2, 2019
The entire dependency of the world’s most popular applications on just one backend in the hands of one company seems a bit startling. It is also surprising how so many people just rely on one hosting service. At the very least, companies should think of setting up a contingency plan, in case the services go down again.
It's such a smart idea to connect so much of our infrastructure to single points of failure, with much of the risks coupled. Smart homes ftw. pic.twitter.com/qiD7l6RWfJ
— zeynep tufekci (@zeynep) June 2, 2019
Half the internet is down
Including Shopify, Snapchat, Youtube, Google Drive
All are hosted on the Google Cloud
Now you see who owns the internet
— Severin Alexander B. (@SeverinAlexB) June 2, 2019
Another issue which popped up was how Google cloud randomly being down is proof that cloud-based gaming isn’t ready for mass audiences yet. At this year’s Game Developers Conference (GDC), Google marked its entry in the game industry with Stadia, its new cloud-based platform for streaming games. It will be launching later this year in select countries including the U.S., Canada, U.K., and Europe.
Google and its services down but yall excited for stadia and cloud gaming in general pic.twitter.com/9lg1LyR8Xd
— FOOT ON NECK HD! (@BrokenGamezHDR) June 2, 2019
It sucks that so many services are tied to Google. If their servers go down, everything else under their umbrella goes down lol Stadia included when the time comes, enjoy your cloud streaming
— Soul Kiwami @ E3 HYPE (@soul_societyy) June 2, 2019
On Monday, Google released an apologetic update on the outage. They outlined the incident, detection and their response.
In essence, the root cause of Sunday’s disruption was a configuration change that was intended for a small number of servers in a single region. The configuration was incorrectly applied to a larger number of servers across several neighboring regions, and it caused those regions to stop using more than half of their available network capacity. The network traffic to/from those regions then tried to fit into the remaining network capacity, but it did not. The network became congested, and our networking systems correctly triaged the traffic overload and dropped larger, less latency-sensitive traffic in order to preserve smaller latency-sensitive traffic flows, much as urgent packages may be couriered by bicycle through even the worst traffic jam.
Next, Google’s engineering teams are conducting a thorough post-mortem to understand all the contributing factors to both the network capacity loss and the slow restoration.