Global Cloudflare outage due to internal database change

0 0 3 minutes read

Global Cloudflare outage due to internal database change

Cloudflare recently experienced a global outage caused by a database permissions update, triggering widespread 5xx errors across its CDN and security services.

The disruption began around 11:20 UTC on November 18, blocking access to customer sites and even blocking Cloudflare’s own team’s access to its internal dashboard. According to a post-mortem published by CEO Matthew Prince, the root cause was a subtle regression introduced during a routine enhancement of their ClickHouse database cluster.

Engineers were rolling out a change intended to improve security by making table access explicit for users. However, this update had an unpleasant and unforeseen side effect on the bot management system. A metadata query that historically returned a clean list of default database columns suddenly started extracting duplicate rows from the underlying r0 database fragments.

Prince explained the technical nuance in the blog post:

The change… allowed all users to access accurate metadata about the tables they have access to. Unfortunately, past assumptions dictated that the list of columns returned by a query like this would only include the “default” database.

This additional data doubled the size of the “features file,” a set of configurations used to track bot threats. Cloudflare’s core proxy software pre-allocates memory for this file as a performance optimization, but it has a hard security limit of 200 features. When the large file reached the network, it exceeded this limit, causing the Bot Management module to crash.

(Source: Cloudflare blog article)

The incident was difficult to diagnose due to its presentation. Because database updates were rolling out gradually, the system kept switching between a “good” state and a “bad” state every few minutes. This erratic behavior initially convinced the engineering team that they were fighting a large-scale DDoS attack rather than an internal bug. The confusion came to a head when Cloudflare’s external status page was also shut down, a complete coincidence that led some to believe that support infrastructure was targeted.

A respondent on a Reddit thread commented:

You don’t realize how many websites use Cloudflare until Cloudflare stops working. Then you try to search how many websites use Cloudflare, but you can’t because all the Google results that would answer your question also use Cloudflare.

“The fact that there was a period of time when our network was unable to carry traffic is deeply painful for everyone on our team,” Prince wrote, noting that it was the largest outage the company has experienced since 2019.

As users grappled with the outage, Dicky Wong, CEO of Syber Couture, highlighted the incident as a validation of multi-vendor strategies. In response to the event, he commented that while Cloudflare offers a brilliant suite of tools, “love is not the same as a marriage without a prenup.” Wong argues that risk management requires a lifestyle shift toward active, multi-hybrid strategies to avoid the “single physical point of failure” that defined this breakdown.

This sentiment was echoed by users on the r/webdev subreddit, where user crazyrebel123 highlighted the fragility of the current internet landscape:

The problem these days is that there are a few large companies that manage or own the majority of things on the Internet. So when one goes down, the entire internet goes one way or the other. Most sites now run on AWS or some form of other cloud service.

Jonathan B., senior technology manager, reinforced this view on LinkedIn, criticizing the tendency of organizations to rely on a single vendor for the sake of “simplicity.”

It’s simple, yes — until that provider becomes the outage everyone’s tweeting about… People call hybrid “old school,” but honestly? It’s just responsible engineering. It comes down to recognizing that outages happen no matter how big the logo is on the side of the cloud.

The service was eventually restored by manually inserting a correct version of the configuration file into the distribution queue. Traffic flows normalized by 2:30 p.m. UTC, with the incident fully resolved by late afternoon. Cloudflare says it is currently reviewing failure modes across all of its proxy modules to ensure memory pre-allocation limits handle bad inputs more effectively in the future.

ahsan65@gmail.com4 weeks ago

0 0 3 minutes read