Most Outages Are Brownouts: Modeling Dependency Degradation

Why partial failures and latency issues cause more production incidents than complete outages.

When we think about failures, we imagine dramatic crashes — servers dying, databases exploding, networks severed. But the failures that actually hurt are subtle: the service that’s running but slow, the API that works 90% of the time, the database that occasionally times out.

These are brownouts. And they’re everywhere.

$14,056/min

Average across organizations

+18%

Cloud provider events in 2024

+18.7%

Longer incidents in 2024

23%

Of all impactful outages

Blackout vs. Brownout

A blackout is a complete failure. The service is down. Health checks fail. Load balancers route around it. Failover triggers. The blast radius is known, and recovery is straightforward: restart, failover, done.

A brownout is partial failure. The service is slow, or flaky, or intermittently timing out. It’s hard to detect because health checks pass. Failover logic gets confused — is it down or just slow? The blast radius cascades unpredictably. Recovery is unclear because the problem is unclear.

CharacteristicBlackout (Complete Failure)Brownout (Partial Failure)
Service stateDOWNSLOW or FLAKY
Detection difficultyEasy — health checks failHard — often passes health checks
Failover behaviorTriggers immediatelyConfused — is it really down?
Blast radiusKnown and containedCascading and unpredictable
Recovery pathRestart or failoverOften unclear

Why Brownouts Are Worse

They Cascade

Consider a normal request path: Service A calls Service B, which queries a database. Under normal conditions, the entire chain completes in milliseconds. During a brownout in Service B, everything changes.

Service B starts responding slowly — say, 5 seconds instead of 50 milliseconds. Service A’s threads block waiting. Its connection pool exhausts. Service A starts timing out to its own callers. The cascade continues upstream until the entire request path degrades.

According to Parametrix’s 2024 Cloud Outage Report, critical cloud service interruptions lasted 18.7% longer in 2024 than 2023. In 2024 alone, six cloud outage events lasted more than ten hours each, totaling nearly 100 hours of downtime.

The October 2025 AWS Cascade

On October 20, 2025, AWS experienced a severe regional outage when a DynamoDB update contained a critical defect. This triggered cascading failures that affected 113 different AWS services for more than 15 hours. One slow service took down everything that depended on it.

They Exhaust Resources

Under normal operation, a connection to Service B is used for 50 milliseconds and returned to the pool. During a brownout, that same connection is held for 5 seconds — 100x longer. If your pool has 50 connections and you’re handling 10 requests per second, the pool exhausts in 5 seconds. After that, all requests fail — even to completely healthy services.

This is why brownouts often cause more damage than outages. A complete failure affects one path. A brownout can exhaust shared resources that affect everything.

They Defeat Circuit Breakers

Traditional circuit breakers monitor success and failure rates. If failures exceed a threshold, the circuit opens and stops sending traffic. But brownouts succeed — just slowly. A request that takes 4.9 seconds instead of timing out at 5 seconds counts as a success. The circuit never opens, but the system is unusable.

The Circuit Breaker Blind Spot

Standard circuit breakers watch for failures. Brownouts technically succeed. Your circuit breaker configuration might require 50% failures to open, but if requests are completing (slowly), you’ll never hit that threshold. Meanwhile, thread pools exhaust and cascades begin.

The Latency vs. Availability Trap

Your monitoring dashboard shows green across the board: Service B at 100% availability, 0% error rate, health checks passing. But users experience 30-second page loads, timeout errors in their browsers, and abandoned transactions.

This is the latency-availability trap. Availability metrics measure whether responses come back. They don’t measure whether responses come back in time.

Metric TypeWhat It MeasuresWhat It Misses
Availability (up/down)Whether service respondsHow long it takes
Error rate5xx responsesSlow successful responses
Health checkBasic connectivityActual request performance
P99 latencyActual user experienceNothing — this is what you need

According to SRE best practices, teams should define latency SLOs with percentiles (e.g., P95 < 200ms, P99 < 500ms) and treat long-tail breaches separately from availability incidents. If your P99 is high, you’re violating SLAs even if availability is 100%.

Retry Storms: The Brownout Amplifier

When a database is slow (not down), the first request times out. The client retries immediately. That retry also times out. Meanwhile, more requests arrive, each spawning their own retries. Within seconds, the database sees 10x normal load — making it even slower, causing more timeouts, spawning more retries.

This is a retry storm. And it can turn a minor slowdown into complete collapse.

Real-world examples of retry storms abound. The Encore blog documents a 2018 incident where enabling a feature for all customers simultaneously caused a thundering herd that overwhelmed infrastructure. These patterns repeat whenever systems experience sudden load spikes without proper backoff mechanisms.

Breaking the Retry Storm

According to AWS best practices, retries are like a powerful medicine — useful in the right dose, but capable of significant damage when overused. The key strategies:

Exponential backoff: Each retry waits longer than the last. First retry after 1 second, then 2, then 4, capped at some maximum. This spreads load over time instead of hammering immediately.

Jitter: Add randomness to backoff intervals. Without jitter, thousands of clients that timed out simultaneously will all retry simultaneously. With jitter, retries spread across a time window.

Single retry point: In a multi-layer system, retry at exactly one layer — not at every hop. As AWS notes, if each layer in a five-deep stack retries three times, a single database failure causes 243x amplification.

Latency-aware retry decisions: If the last request took 4.9 seconds before succeeding, don’t immediately retry the next timeout. The dependency is struggling — let it recover.

Timeout Configuration: The Art of Giving Up

Timeouts are decisions about when to stop waiting and give up. Too short, and you abandon requests that would have succeeded. Too long, and you hold resources during brownouts.

Determining Good Timeouts

According to Gergely Orosz’s analysis on resiliency, base timeout duration on your acceptable false timeout rate. If you can tolerate 0.1% of eventually-successful requests timing out, set your timeout at the 99.9th percentile of the downstream service’s response time.

The Timeout Stack

In a multi-layer system, timeouts must be coordinated. The outer layer’s timeout must exceed the sum of inner timeouts, with room for retries and overhead. If your gateway times out at 8 seconds, but the full downstream chain needs 10 seconds, you’ll abort work that was about to complete.

Timeout Anti-PatternProblemResult
Same timeout everywhereOuter layers timeout before inner completeWasted work, zombie requests
No timeout at allRequest hangs foreverThread exhaustion, memory leak
Timeout too highResources held too longPool exhaustion during brownout
Timeout too lowValid slow requests failFalse failures, retry storms

Adaptive Timeouts

Static timeouts assume stable latency. Real systems don’t work that way. An adaptive timeout tracks recent latency and adjusts accordingly — perhaps setting timeout at 2x the current P99, with a floor and ceiling. When the dependency speeds up, timeouts tighten. When it slows down, timeouts relax (within bounds).

Building Brownout Resilience

Deadline Propagation

Instead of each layer having independent timeouts, pass a deadline through the call chain. If the original request needs to complete in 10 seconds, downstream services receive “you have 8 seconds remaining.” If time runs out, abort early instead of doing work that can’t be returned in time.

Load Shedding

When latency exceeds acceptable thresholds, start rejecting some requests immediately. It’s better to fail fast for some users than to fail slowly for everyone. A simple approach: when current latency exceeds target, randomly reject 50% of incoming requests. This gives the system room to recover.

Latency-Aware Circuit Breakers

Extend your circuit breaker beyond success/failure to include latency violations. Track both failure count and latency-threshold violations. If either exceeds the limit, open the circuit. This catches brownouts that succeed technically but fail practically.

Why Timeouts Matter More Than Crashes

Consider two 10-minute incidents. In the first, a service crashes completely. The load balancer removes the instance within 30 seconds. Traffic routes to healthy instances. Users experience 30 seconds of failures, then normal operation.

In the second, a service enters brownout — 50% of requests are slow. The load balancer keeps routing traffic because the service is “healthy.” Half of all requests are slow. Thread pools exhaust. Cascades begin. Users experience 10 minutes of degraded service, getting worse as retry storms amplify.

30 sec

Failover removes unhealthy instance

10+ min

Cascading degradation continues

The crash is dramatic but contained. The brownout is subtle but devastating. Understanding this difference is understanding real-world reliability.

References


This is part 1 of our “Dependency & Network Failure Intelligence” series. Next: Why Network Semantics Matter More Than Packet Loss.