On the 9th of November, at 4.30 am US/Pacific time, the Google Kubernetes Engine faced a service disruption. It was questionable whether or not a user would be able to launch a node pool through Cloud Console UI. The team responded to the issue saying that they would get back to users with more information by Friday, 9th November 04:45 am US/Pacific time.
However, this was not solved by the given time. Another status update was posted by the team assuring users that mitigation work was underway by the Engineering Team. Users were to be posted with another update by 06:00 pm US/Pacific with current details.
In the meantime, affected customers were advised to use gcloud command to create new Node Pools.
An update for the issue being finally resolved was posted on Sunday, the 11th of November, stating that services were restored on Friday at 14:30 US/Pacific time. . However, no proper explanation has been provided regarding what led to the service disruption. They did mention that an internal investigation of the issue will be done and appropriate improvements to the systems will be implemented to help prevent or minimize future recurrence of the issue.
According to a user’s summary on Hacker News, “Some users here are reporting that other GCP services not mentioned by Google’s blog are experiencing problems. Some users here are reporting that they have received no response from GCP support, even over a time span of 40+ hours since the support request was submitted.”
According to another user, “When everything works, GCP is the best. Stable, fast, simple, reliable. When things stop working, GCP is the worst. They require way too much work before escalating issues or attempting to find a solution”.
We can’t help but agree looking at the timeline of the service downtime.
Users have also expressed disappointment over how the outage was managed.
With users demanding a root cause analysis of the situation, it is only fitting that Google provides one so users can trust the company better.
You can check out Google Cloud’s blog post detailing the timeline of the downtime.