Cloudflare Reveals Cause of Massive Internet Outage

Contents

For Customers & End-Users: What To Watch

At 11:20 UTC on 18 November 2025, Cloudflare’s network began experiencing what the company described as “significant failures to deliver core network traffic”. The disruption quickly surfaced for customers and end-users as error pages—specifically HTTP 5XX status codes indicating server-side failure. The company emphasised that no cyber-attack or malicious activity triggered the incident.

Yet, the wide-range impact was tangible: major services reliant on Cloudflare’s infrastructure—such as X (formerly Twitter), ChatGPT/OpenAI, and others—reported outages or degraded service.

Why did it matter so much? Because Cloudflare is deeply embedded in the backbone of the Internet: the company says its systems protect “websites, apps, APIs and AI workloads while accelerating performance,” and independent data suggest its services are used by over 20 per cent of all websites.

What Went Wrong: Technical Breakdown

Cloudflare’s own incident report and follow-up commentary provide a detailed map of error-and-recovery.

Permission change in database system A change was made to the permissions of one of Cloudflare’s database systems (using ClickHouse) at around 11:05 UTC. This change enabled users to see metadata from underlying shards (r0 database) rather than only the distributed tables in the “default” database. In effect it exposed duplicate rows of column metadata in a query that built a “feature file” used in Cloudflare’s Bot Management system.

Feature file doubles in size The feature-configuration file—delivered to every machine across Cloudflare’s network and consumed by the Bot Management module—grew to more than double its expected entries. Because the logic assumed the feature file would stay within a certain size (the module assumed ~60 features while its hard-limit was around 200), the inflated file exceeded memory allocation limits and triggered a crash in parts of Cloudflare’s traffic-routing engine.

The chart below shows the volume of 5xx error HTTP status codes served by the Cloudflare network. Normally this should be very low, and it was right up until the start of the outage.

Crashing components and cascading failures The corrupted or oversized file was propagated network-wide, including to machines running Cloudflare’s proxy services (FL/“Frontline” and its successor FL2). On the newer engine (FL2), the crash resulted in HTTP 5XX errors; on the older engine (FL) bot scores returned zero, causing false positives for blocking rules—so although traffic didn’t fully fail, behaviour was degraded.

Recovery and remediation Recognising that it wasn’t a DDoS attack (initially suspected), Cloudflare stopped propagation of the faulty file around 14:24 UTC, rolled back to a known-good version, and forced restarts of core proxy services. By 14:30 UTC the main impact was resolved; by around 17:06 UTC all systems were reported fully restored.

Wider Implications & Lessons

This is not merely a “bug” for technology teams to fix internally. It carries implications across three domains:

Reliance on major infrastructure providers The outage highlighted how heavily the internet depends on a handful of large infrastructure firms. When Cloudflare falters, sites, apps and services worldwide feel it. One expert quoted by The Guardian called Cloudflare “a gatekeeper” of traffic flows.

Complexity breeds fragility The root of the failure was not a malicious act but a well-intentioned permissions change that interacted unexpectedly with internal tooling, metadata queries and configuration-file generation pipelines. The chain from database metadata → feature file → module memory limit reveals fragile points of failure in even high-scale systems.

Visibility & accountability for downtime Cloudflare admitted this was “our worst outage since 2019”. The company’s public apology notes that given its role in the Internet ecosystem, “any outage of any of our systems is unacceptable”. The transparency of the incident and the commitment to follow-through matter for customers, regulators and the broader internet community.

What Happens Next?

Cloudflare outlined remediation steps including:

Hardening ingestion of automatically generated configuration files (treating these like user-input).
Introducing global kill-switches for features.
Ensuring core dumps or error-reports cannot overwhelm system resources.
Reviewing failure modes for all core-proxy modules.

These sound sensible — though the real test will be the next time something unexpected happens, and whether the system is resilient enough to avoid broad service disruption.

For Customers & End-Users: What To Watch

If you run a website, API or service depending on Cloudflare: audit your dependency, and consider fallback or multi-provider architecture for critical traffic.
Monitor your metrics for sudden error spikes or latency increases: they may signal upstream issues rather than your own code.
For users: when error pages appear (“500 Internal Server Error” etc), the cause may lie upstream of the site you’re visiting—even if it appears like that site is down.

Conclusion

The outage on 18 November 2025 serves as a reminder: the internet’s smooth surface masks deep complexity beneath. Even a “permission tweak” inside a database can ripple into a global outage. As our world becomes more interconnected—websites, mobile apps, APIs, AI services—the margin for error shrinks. Infrastructure providers like Cloudflare must not only build scale but also protect for the unexpected, minimise single points of failure and preserve trust.

WarMax356 Founder

See Full Bio

Useful Links

Cloudflare Reveals Cause of Massive Internet Outage

What Went Wrong: Technical Breakdown

Wider Implications & Lessons

What Happens Next?

For Customers & End-Users: What To Watch

Conclusion

Coordinator (Collections Management)

2026-237 – Technical Architect

Coordinator (Human Resources)