A history of the outages that broke shared internet infrastructure in ways the United States could feel
The internet just disappeared
In October 2021, Facebook disappeared because Facebook withdrew Facebook.
One maintenance operation touched Meta’s backbone routers. The routes vanished. WhatsApp stopped loading. Instagram stopped loading. Facebook itself stopped loading. Messenger went with it. To users, it looked like a familiar kind of internet failure. Apps would not open. Pages would not load. Login screens stalled. Retry did nothing.
Underneath that, the more interesting failure was already in progress. The engineers trying to fix the outage lost access to some of the same internal systems they needed to fix it. The company did not just break the public edge. It broke part of its own path back in.
The outage did not get harder because the apps were down. It got harder because the recovery path shared fate with the failure.
Unbelievable detail
Engineers ended up using physical key cards to get into rooms they had not needed to enter manually in years.
That kind of failure matters more now because more of daily life sits behind fewer shared layers than most people realize.
Not every outage belongs here. Not every service wobble. Not every afternoon where people post screenshots of broken apps and call it “the internet.” The incidents in this record are the ones where shared infrastructure failed in ways the United States could feel, where the blast radius escaped the original system, and where the postmortem revealed a dependency large enough to matter beyond one brand.
This piece is a record of those failures.
The hidden architecture
To understand why these outages spread the way they do, you have to see the internet the way operators do: as a few heavily loaded layers carrying much more than they look like they are carrying.
Naming
DNS looks boring until it fails. Then major sites that have nothing to do with each other start disappearing together because they all depend on the same provider to answer the same basic question: where is this thing? Names fail early, which makes everything downstream feel broken.
Routing
BGP, the internet’s routing protocol, is the map layer. Routing failures are strange because the machines often still exist. The path to them is what disappears.
Front doors
CDNs, edge networks, reverse proxies, and DDoS mitigation platforms made the web fast and quietly made it much more coupled. The sites look separate. The middleware is not.
Gates
Identity systems, OAuth, SSO, token signing, and certificate chains sit in front of everything else. A service can be healthy and still be functionally dead if nobody can authenticate through the layer in front of it.
Control planes
Metadata stores, orchestration, service discovery, internal dashboards, rollback systems, and status systems decide whether an outage is fixable in ten minutes or drags into the night. When they fail, the outage stops being a service problem and becomes a coordination problem.
Physical infrastructure
Buildings. Power. Cooling. Cable paths. Fuel. Carrier facilities. Emergency services sharing infrastructure with commercial traffic. Cloud abstraction hides geography. It does not remove it.
The internet is not one thing. It is a handful of very loaded layers carrying much more than they look like they are carrying.
Takeaway
The internet breaks where names are translated, where routes are trusted, where front doors are shared, where gates decide access, where control planes coordinate recovery, and where physical chokepoints stop being invisible.
The propagation map
The usual way to tell outage history is by vendor. Cloud outage. DNS outage. CDN outage. That tells you who broke. It does not tell you how breaking works.
The more useful split is by propagation type.
| Type | What it feels like | What actually broke |
|---|---|---|
| Naming failure | The site is gone | DNS stopped turning names into destinations |
| Map failure | Nothing can reach the service | Routing stopped taking traffic to the right place |
| Gate failure | The service loads but nobody can use it | Identity or auth failed in front of everything |
| Front-door failure | Unrelated sites break together | Shared CDN or edge infrastructure failed |
| Cascade failure | The outage gets worse while recovery slows | The control plane and fix path got dragged in too |
The rest of the piece follows those five paths on purpose. Most big outages are propagation stories. The initial fault matters less than the layer that amplified it.
Map failures: when the internet believes the wrong route
BGP tells the internet where things are. More precisely, it lets networks tell each other where things are. It is not a neutral map. It is a system of assertions. Local claims can become global facts very quickly if enough other networks accept them.
The service is fine. The servers are fine. The origin exists. Nothing reaches it anyway.
Pakistan Telecom / YouTube (2008)
Pakistan was trying to block YouTube domestically. Instead, it announced a more specific route for YouTube’s address space and that route propagated outward. Traffic headed for YouTube got redirected into a black hole.
Unbelievable detail
A local censorship action became a global routing fact.
What made this spread
Original fault: bad route announcement
Amplification layer: BGP accepted and propagated it
Why recovery was hard: the wrong route looked authoritative long enough to travel
Verizon BGP leak (2019)
A small provider in Pennsylvania advertised routes it did not own. Verizon accepted them and propagated them. Cloudflare, Facebook, Amazon, and others were affected. This was not some exotic attack. It was misconfiguration meeting inadequate filtering.
What made this spread
Original fault: route leak from a small provider
Amplification layer: tier-1 carrier propagation
Why recovery was hard: filtering was weaker than the blast radius
CenturyLink / Level 3 (2020)
One bad Flowspec rule, a tool for filtering traffic at scale, effectively blocked BGP itself, which meant routers kept choking on the very updates needed to restore routing. The outage did not just break traffic. It made correction harder.
Unbelievable detail
The network started rejecting the information required to fix the network.
Meta / Facebook (2021)
Meta pushed the routing failure pattern into public consciousness. The routes disappeared. The names still existed. The servers still existed. But the path was gone, and the company’s own internal systems were too entangled with the failure to recover cleanly.
Takeaway
A routing error is a local claim with global consequences.
Naming failures: when the service still exists but the internet cannot find it
DNS failures hit early and spread wide. That is why they feel bigger than they “should.”
Users do not experience DNS as an elegant naming abstraction. They experience it as the difference between a website existing and not existing. If names stop resolving, the rest of the stack barely matters from the user’s position.
Dyn / Mirai (2016)
A Mirai botnet made from compromised cameras, routers, DVRs, and other junk devices hammered Dyn’s infrastructure hard enough that large parts of the U.S. web started failing together. Twitter, Reddit, GitHub, the New York Times, Netflix, Spotify, and others all went weird in clusters.
Unbelievable detail
The attack surface was every cheap internet-connected device someone never changed the password on.
Azure DNS (2021)
A naming-layer problem cascaded into broader Azure service failures, and parts of the status experience became unreliable too. Users did not just lose the service. They lost clean visibility into what had failed.
Akamai Edge DNS (2021)
One configuration change in one huge DNS provider made a large piece of U.S. commerce and public infrastructure feel absent. Airlines. Banks. Shipping companies. Consumer platforms. Government sites.
Takeaway
If names stop resolving, everything downstream feels broken whether it is or not.
Front-door failures: when unrelated sites break together
Users can go years without thinking about CDNs and edge networks. Then one of them fails and suddenly journalism, shopping, developer tools, gaming, and government all start returning the same kind of dead response at the same time. That is when people accidentally learn what shared middleware is.
Fastly (June 2021)
One customer pushed a valid configuration change. That change triggered a latent bug. Within minutes, major sites across multiple industries were returning errors together.
Unbelievable detail
One customer change, global blast radius.
Akamai Prolexic (July 2021)
A DDoS mitigation platform, the thing customers bought specifically to preserve availability, became the source of unavailability. The system meant to smooth over internet fragility became a fragility multiplier instead.
Takeaway
Once enough of the web is fronted by the same infrastructure, site diversity becomes surface decoration.
The modern web fails by dependency cluster.
Gate failures: when access breaks before infrastructure does
Gate failures are quieter on the network and louder in real life.
The service may still be up. The servers may still be reachable. DNS may still resolve. The problem is that everything useful sits behind an identity layer that just stopped letting people through.
Google authentication / OAuth (2020)
Google’s outage affected both Google’s own products and third-party services using Google OAuth. Gmail, YouTube, Docs, and other services all inherited the same gate failure because identity sat in front of all of them.
Unbelievable detail
The session store failed, and the gate failed with it.
Azure AD key-rotation (2021)
A routine signing-key problem became a broad authentication failure across Microsoft 365, Teams, Exchange Online, and Azure services. The failure was not loud in the way routing failures are loud. It was loud in the workflow. Offices stopped moving.
Takeaway
A service can be alive and still be unusable if the gate in front of it fails.
Cascade failures: when the control plane breaks recovery too
Cloud infrastructure is not just compute. It is metadata, orchestration, service discovery, status systems, rollback tools, dashboards, internal networking, and the consoles people use to understand what is failing. When those layers start failing with the public service, the outage changes shape. It is no longer just an availability problem. It becomes a recovery problem.
AWS S3 us-east-1 (2017)
A mistyped debugging command removed more capacity than intended. The memorable part is what that exposed about dependency design.
Unbelievable detail
The AWS status page could not report that S3 was down because the status page itself depended on S3.
Google Cloud networking (2019)
A routine reliability change cascaded through internal coordination systems and created major traffic loss between regions. The root cause was boring. The propagation was not.
AWS us-east-1 (2021)
Internal DNS and networking trouble turned into a broad public outage across Alexa, Ring, logistics systems, and third-party services. Region concentration was part of the problem. The depth of internal dependency on that region was the rest.
CrowdStrike Falcon update (2024)
One bad Falcon update hit enough Windows machines, in enough critical sectors, that the distinction between software failure and infrastructure failure stopped mattering. Flights were grounded. Hospitals delayed procedures. Emergency services went manual. Recovery took days because millions of endpoints needed human attention.
Takeaway
Shared software at scale behaves like infrastructure.
Recovery gets slower when the system you need to fix the outage is part of the outage.
Physical failures: geography always mattered
The cloud still runs on buildings.
CenturyLink 911 outage (2018)
A routing-related software failure disrupted 911 service across multiple states and affected millions of customers. Emergency calling failed because public-safety systems shared a failure domain with commercial backbone infrastructure.
The Planet Houston datacenter failure (2008)
A power failure should have been survivable. Backup generators existed. The failover did not behave the way “backup” suggests backup should behave. Thousands of servers went dark and recovery dragged across days.
Unbelievable detail
The backup system did not back up the system.
The internet runs on diesel and electricity more often than cloud language would like to admit.
Hurricane Katrina telecom failures (2005)
Katrina wiped out fiber paths, towers, power, and fuel logistics together, which is the purest version of the argument: redundancy stops helping when the same event destroys the redundant paths too.
Takeaway
The cloud still depends on geography. It just hides it better.
The same outage keeps coming back with different logos
The initial triggers are usually boring.
A configuration change. A maintenance operation. A typo. A route policy mistake. A certificate issue. A software update. Human error keeps scaling because infrastructure scale keeps scaling faster than safeguards.
Concentration is the first multiplier. More of the internet now sits behind fewer providers, fewer naming systems, fewer front doors, fewer cloud regions, fewer identity gates. A small number of high-degree nodes carry a ridiculous amount of public life.
Propagation is the second multiplier. Blast radius depends on where the fault lands in the stack, not on how cinematic the root cause sounds. A mundane config error at the wrong layer will fail louder than a dramatic incident at the right one.
The recovery path is the third multiplier. The hardest incidents are the ones that damage the fix path while they damage the service. Meta in 2021. AWS S3 in 2017. CrowdStrike in 2024. These outages are revealing not because they were embarrassing but because they expose design choices that kept recovery coupled to failure.
Near-misses are part of the story too. A lot of apparent stability is operator skill, emergency rollback, manual failover, and someone noticing the weird thing early enough to stop it from becoming public. The real fragility of these systems is larger than the incident archive suggests.
Shared software is the final multiplier. Once enough organizations depend on the same agent, the same update mechanism, the same identity broker, or the same endpoint security stack, software failure starts behaving exactly like infrastructure failure. The blast radius does not care which label you put on it.
Takeaway
The same failures keep returning because the same dependency patterns keep surviving them.
What improved and what got worse
Some things improved.
Postmortems are better. They are longer, more specific, and more willing to admit ugly coupling. Rollback discipline improved in places that got burned hard enough to take it seriously. Routing hygiene is stronger than it used to be. More teams now treat identity and control-plane fragility as real engineering problems instead of weird edge cases.
What got worse matters more.
Concentration increased. Shared fate expanded. High-degree nodes got higher-degree. Regional dependence on a handful of cloud zones deepened. Software monoculture grew. Operator burden increased because the systems are more coupled and the humans holding them together are carrying more of the actual resilience load than the architecture deserves credit for.
Takeaway
The internet got better at recovering from fragility while continuing to accumulate more of it.
The atlas underneath this essay
This project only works if the essay is not pretending to be the whole thing.
The article uses a fixed set of thirteen incidents because narrative needs selection. The atlas underneath it is where the argument becomes testable. Every incident lives as a structured record: name, date, duration, root cause, what broke, U.S. impact, fix, unbelievable detail, what changed after, primary sources, related incidents, confidence notes.
The point of the archive is not decorative completeness. It is accountability. It forces the article’s claims back down onto specific cases, specific causes, specific blast radii, specific fixes.
The atlas lives at https://outage-archive.jonathanrreed.com/.
The structure is fragile, the dependence is concentrated, and the thing saving the system is usually not elegance. It is intervention.
The essay is the argument. The atlas is the evidence system underneath it.
References
Primary and official sources
- Meta Engineering. “More details about the October 4 outage.” https://engineering.fb.com/2021/10/05/networking-traffic/outage-details/
- AWS. “Summary of the Amazon S3 Service Disruption in the Northern Virginia (US-EAST-1) Region.” https://aws.amazon.com/message/41926/
- Google Cloud Status. “Google services outage incident report, December 2020.” https://status.cloud.google.com/incident/zall/20013
- Fastly. “June 8 outage postmortem.” https://www.fastly.com/blog/summary-of-june-8-outage
- Akamai. “Edge DNS service interruption, July 2021.” https://www.akamai.com/blog/news/akamai-service-disruption-july-22-2021
- Cloudflare. “How Verizon and a BGP optimizer knocked large parts of the Internet offline today.” https://blog.cloudflare.com/how-verizon-and-a-bgp-optimizer-knocked-large-parts-of-the-internet-offline-today/
- Cloudflare. “Analysis of today’s CenturyLink/Level 3 outage.” https://blog.cloudflare.com/analysis-of-todays-centurylink-level-3-outage/
- FCC. “CenturyLink 911 outage report and consent decree materials.” https://docs.fcc.gov/public/attachments/DOC-356196A1.pdf
- CISA. “Blue screen of death outage caused by CrowdStrike update.” https://www.cisa.gov/news-events/alerts/2024/07/19/blue-screen-death-outage-caused-crowdstrike-update
- CrowdStrike. “Preliminary post-incident review.” https://www.crowdstrike.com/blog/falcon-content-update-preliminary-post-incident-report/
Supporting analysis and measurement
- ThousandEyes. “Understanding the Dyn DDoS attack.” https://www.thousandeyes.com/blog/dyn-dns-ddos-attack
- RIPE NCC Labs. “YouTube hijacking: a RIPE NCC RIS case study.” https://labs.ripe.net/author/emileaben/youtube-hijacking-a-ripe-ncc-ris-case-study/
- Ars Technica. “How Pakistan knocked YouTube offline.” https://arstechnica.com/information-technology/2008/02/how-pakistan-knocked-youtube-offline-and-how-to-make-sure-it-never-happens-again/
- Reuters. “Akamai outage disrupts websites of banks, airlines, and others.” https://www.reuters.com/world/us/akamai-outage-disrupts-websites-banks-airlines-others-2021-07-22/
- Downdetector. Historical outage tracking. https://downdetector.com/
Jonathan Reed · LinkedIn