For the second time in less than a week, Cloudflare was part of the major internet outage affecting many websites for about an hour, yesterday due to a software glitch. Last week, Cloudflare users faced a major outage when Verizon accidentally rerouted IP packages after it wrongly accepted a network misconfiguration from a small ISP in Pennsylvania, USA.
Cloudflare’s CTO John Graham-Cumming wrote yesterday’s outage was due to a massive spike in CPU utilization in the network.
Many users complained of seeing “502 errors” displayed in their browsers when they tried to visit its clients. Downdetector, the website which updates users of the ongoing outages, service interruptions also flashed a 502 error message.
DownDetector is unable to tell if cloudflare is down because cloudflare is down. 🙃 pic.twitter.com/pBSgv9QISc
— Tarjei Husøy (@t_husoy) July 2, 2019
Graham-Cumming wrote, “This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels”.
A single misconfigured rule, the actual cause of the outage
What must have been the cause of the outage is a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), deployed during a routine deployment of new Cloudflare WAF Managed rules. Though the company has automated systems to run test suites and a procedure for deploying progressively to prevent incidents, these WAF rules were deployed globally in one go and caused yesterday’s outage.
so that cloudflare outage was a caused by a single regex rule deployed globally in one go🤦♂️ pic.twitter.com/xws5kQZ59K
— mjosdwez (@mjos_crypto) July 2, 2019
After finding out the actual cause of the issue, Cloudflare issued a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic at 1409 UTC. They also ensured that the problem was fixed correctly and re-enabled the WAF Managed Rulesets at 1452 UTC.
CTO of Cloudflare makes point that outage was ultimately an engineering management issue, not an individual https://t.co/dUgevl5asM
— SwiftOnSecurity (@SwiftOnSecurity) July 3, 2019
“Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future”, the Cloudflare blog states.
A user said Cloudflare should have been careful of rolling out the feature globally while it was staged for a rollout.
The post mortem of today’s @Cloudflare outage is up: it’s a sobering read.
– testing in production (dark launch) isn’t without its perils. It can take down your production
– don’t do *global* deploys, even for staged rollout of feature flags/dark traffic. https://t.co/HvRkWcIh8H pic.twitter.com/triEvf7NE9
— Cindy Sridharan (@copyconstruct) July 2, 2019
Cloudflare confirms the outage was ‘a mistake’ and not an attack
Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc. etc, which Graham-Cumming tweeted were untrue and “It was not an attack by anyone from anywhere”.
CloudFare’s CEO, Matthew Prince, also confirmed that the outage was not a result of the attack but a “mistake on our part.”
I've seen a bunch of speculation that today's @Cloudflare outage was caused by a DDoS from China, Iran, North Korea, etc. etc.
It was not an attack by anyone from anywhere.
— John Graham-Cumming (@jgrahamc) July 2, 2019
Many users have applauded that Cloudflare has accepted the fact that it was an organizational / engineering management issue and not an individual’s fault.
I give Cloudflare a hard time about abuse resourcing, but gotta give kudos here. A single regex bringing down Cloudflare globally is not a person issue, it’s an org/design issue. Glad to see transparency and no scapegoats. https://t.co/8C1BW1FdvX
— Kevin Beaumont 🌈 (@GossiTheDog) July 2, 2019
Prince told Inc., “I’m not an alarmist or a conspiracy theorist, but you don’t have to be either to recognize that it is ultimately your responsibility to have a plan. If all it takes for half the internet to go dark for 20 minutes is some poorly deployed software code, imagine what happens when the next time it’s intentional.”
To know more about this news in detail, read Cloudflare’s official blog.