Data

Cloudflare suffers 2nd major internet outage in a week. This time due to globally deploying a rogue regex rule.

3 min read

For the second time in less than a week, Cloudflare was part of the major internet outage affecting many websites for about an hour, yesterday due to a software glitch. Last week, Cloudflare users faced major outage when Verizon accidentally rerouted IP packages after it wrongly accepted a network misconfiguration from a small ISP in Pennsylvania, USA.

Cloudflare’s CTO John Graham-Cumming wrote yesterday’s outage was due to a massive spike in CPU utilization in the network.

Source: Cloudflare

Many users complained of seeing “502 errors” displayed in their browsers when they tried to visit its clients. Downdetector, the website which updates users of the ongoing outages, service interruptions also flashed a 502 error message.

Graham-Cumming wrote, “This CPU spike was caused by a bad software deploy that was rolled back. Once rolled back the service returned to normal operation and all domains using Cloudflare returned to normal traffic levels”.

A single misconfigured rule, the actual cause of the outage

What must have been the cause of the outage is a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), deployed during a routine deployment of new Cloudflare WAF Managed rules. Though the company has automated systems to run test suites and a procedure for deploying progressively to prevent incidents, these WAF rules were deployed globally in one go and caused yesterday’s outage.

These new rules were to improve the blocking of inline JavaScript that is used in attacks. “Unfortunately, one of these rules contained a regular expression that caused CPU to spike to 100% on our machines worldwide. This 100% CPU spike caused the 502 errors that our customers saw. At its worst traffic dropped by 82%”, Graham-Cumming writes.

After finding out the actual cause of the issue, Cloudflare issued a ‘global kill’ on the WAF Managed Rulesets, which instantly dropped CPU back to normal and restored traffic at 1409 UTC. They also ensured that the problem was fixed correctly and re-enabled the WAF Managed Rulesets at 1452 UTC.

“Our testing processes were insufficient in this case and we are reviewing and making changes to our testing and deployment process to avoid incidents like this in the future”, the Cloudflare blog states.

A user said Cloudflare should have been careful of rolling out the feature globally while it was staged for a rollout.

Cloudflare confirms the outage was ‘a mistake’ and not an attack

Cloudflare also received speculations that this outage was caused by a DDoS from China, Iran, North Korea, etc. etc, which Graham-Cumming tweeted were untrue and “It was not an attack by anyone from anywhere”.

CloudFare’s CEO, Matthew Prince, also confirmed that the outage was not a result of the attack but a “mistake on our part.”

Many users have applauded that Cloudflare has accepted the fact that it was an organizational / engineering management issue and not an individual’s fault.

Prince told Inc., “I’m not an alarmist or a conspiracy theorist, but you don’t have to be either to recognize that it is ultimately your responsibility to have a plan. If all it takes for half the internet to go dark for 20 minutes is some poorly deployed software code, imagine what happens when the next time it’s intentional.

To know more about this news in detail, read Cloudflare’s official blog.

Read Next

A new study reveals how shopping websites use ‘dark patterns’ to deceive you into buying things you may not want

OpenID Foundation questions Apple’s Sign In feature, says it has security and privacy risks

Email app Superhuman allows senders to spy on recipients through tracking pixels embedded in emails, warns Mike Davidson

Savia Lobo

A Data science fanatic. Loves to be updated with the tech happenings around the globe. Loves singing and composing songs. Believes in putting the art in smart.

Share
Published by
Savia Lobo

Recent Posts

Top life hacks for prepping for your IT certification exam

I remember deciding to pursue my first IT certification, the CompTIA A+. I had signed…

3 years ago

Learn Transformers for Natural Language Processing with Denis Rothman

Key takeaways The transformer architecture has proved to be revolutionary in outperforming the classical RNN…

3 years ago

Learning Essential Linux Commands for Navigating the Shell Effectively

Once we learn how to deploy an Ubuntu server, how to manage users, and how…

3 years ago

Clean Coding in Python with Mariano Anaya

Key-takeaways:   Clean code isn’t just a nice thing to have or a luxury in software projects; it's a necessity. If we…

3 years ago

Exploring Forms in Angular – types, benefits and differences   

While developing a web application, or setting dynamic pages and meta tags we need to deal with…

3 years ago

Gain Practical Expertise with the Latest Edition of Software Architecture with C# 9 and .NET 5

Software architecture is one of the most discussed topics in the software industry today, and…

3 years ago