3 min read

For the second time in less than a week, Cloudflare was part of the major internet outage affecting many websites for about an hour, yesterday due to a software glitch. Last week, Cloudflare users faced a major outage when Verizon accidentally rerouted IP packages after it wrongly accepted a network misconfiguration from a small ISP in Pennsylvania, USA.

Cloudflare’s CTO John Graham-Cumming wrote yesterday’s outage was due to a massive spike in CPU utilization in the network.

Source: Cloudflare

Many users complained of seeing “502 errors” displayed in their browsers when they tried to visit its clients. Downdetector, the website which updates users of the ongoing outages, service interruptions also flashed a 502 error message.

Graham-Cumming wrote, “This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels”.

A single misconfigured rule, the actual cause of the outage

What must have been the cause of the outage is a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), deployed during a routine deployment of new Cloudflare WAF Managed rules. Though the company has automated systems to run test suites and a procedure for deploying progressively to prevent incidents, these WAF rules were deployed globally in one go and caused yesterday’s outage.

These new rules were to improve the blocking of inline JavaScript that is used in attacks. “Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%”, Graham-Cumming writes.

After finding out the actual cause of the issue, Cloudflare issued a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic at 1409 UTC. They also ensured that the problem was fixed correctly and re-enabled the WAF Managed Rulesets at 1452 UTC.

“Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future”, the Cloudflare blog states.

A user said Cloudflare should have been careful of rolling out the feature globally while it was staged for a rollout.

Cloudflare confirms the outage was ‘a mistake’ and not an attack

Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc. etc, which Graham-Cumming tweeted were untrue and “It was not an attack by anyone from anywhere”.

CloudFare’s CEO, Matthew Prince, also confirmed that the outage was not a result of the attack but a “mistake on our part.”

Many users have applauded that Cloudflare has accepted the fact that it was an organizational / engineering management issue and not an individual’s fault.

Prince told Inc., “I’m not an alarmist or a conspiracy theorist, but you don’t have to be either to recognize that it is ultimately your responsibility to have a plan. If all it takes for half the internet to go dark for 20 minutes is some poorly deployed software code, imagine what happens when the next time it’s intentional.

To know more about this news in detail, read Cloudflare’s official blog.

Read Next

A new study reveals how shopping websites use ‘dark patterns’ to deceive you into buying things you may not want

OpenID Foundation questions Apple’s Sign In feature, says it has security and privacy risks

Email app Superhuman allows senders to spy on recipients through tracking pixels embedded in emails, warns Mike Davidson

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.