Cloudflare RCA: Major outage was a lot more than “a regular expression went bad”

On July 2, 2019, Cloudflare suffered a major outage due to a massive spike in CPU utilization in the network. Ten days after the outage, on July 12, Cloudflare’s CTO John Graham-Cumming, has released a report highlighting the details about how the Cloudflare service went down for 27 minutes.

During the outage, the company speculated the reason to be a single misconfigured rule within the Cloudflare Web Application Firewall (WAF), deployed during a routine deployment of new Cloudflare WAF Managed rules. This speculation turns out to be true and caused CPUs to become exhausted on every CPU core that handles HTTP/HTTPS traffic on the Cloudflare network worldwide.

Graham-Cumming said they are “constantly improving WAF Managed Rules to respond to new vulnerabilities and threats”.

The CPU exhaustion was caused by a single WAF rule that contained a poorly written regular expression that ended up creating excessive backtracking.

cloudflare-rca-major-outage-was-a-lot-more-than-a-regular-expression-went-bad-img-0

Source: Cloudflare report

The regular expression that was at the heart of the outage is :

cloudflare-rca-major-outage-was-a-lot-more-than-a-regular-expression-went-bad-img-1

Graham-Cumming says Cloudflare deploys dozens of new rules to the WAF every week, and also have numerous systems in place to prevent any negative impact of that deployment. He shared a list of vulnerabilities that caused the major outage.

cloudflare-rca-major-outage-was-a-lot-more-than-a-regular-expression-went-bad-img-2

What’s Cloudflare doing to mend the situation?

Graham-Cumming said they had stopped all release work on the WAF completely and are following some processes:

cloudflare-rca-major-outage-was-a-lot-more-than-a-regular-expression-went-bad-img-3 He says, for longer-term, Cloudflare is “moving away from the Lua WAF that I wrote years ago”. The company plans to port the WAF to use the new firewall engine, which provides customers the ability to control requests, in a flexible and intuitive way, inspired by the widely known Wireshark language. This will make the WAF both faster and add yet another layer of protection.

Users have appreciated Cloudflare’s efforts in taking immediate calls for the outage and being completely transparent about the root cause of it with a complete post mortem report.

https://twitter.com/fatih/status/1150014793253904386

https://twitter.com/nealmcquaid/status/1150754753825165313

https://twitter.com/_stevejansen/status/1150928689053470720

“We are ashamed of the outage and sorry for the impact on our customers. We believe the changes we’ve made mean such an outage will never recur,” Graham-Cumming writes.

Read the complete in-depth report by Cloudflare on their blog post.

How Verizon and a BGP Optimizer caused a major internet outage affecting Amazon, Facebook, CloudFlare among others

Cloudflare adds Warp, a free VPN to 1.1.1.1 DNS app to improve internet performance and security

Cloudflare raises $150M with Franklin Templeton leading the latest round of funding