Dependency & Network Failure Intelligence
Most Outages Are Brownouts: Modeling Dependency Degradation
Why partial failures and latency issues cause more production incidents than complete outages.
When we think about failures, we imagine dramatic crashes — servers dying, databases exploding, networks severed. But the failures that actually hurt are subtle: the service that’s running but slow, the API that works 90% of the time, the database that occasionally times out.
These are brownouts. And they’re everywhere.
Downtime Cost
Average across organizations
Critical Outages
Cloud provider events in 2024
Duration Increase
Longer incidents in 2024
IT/Network Issues
Of all impactful outages
Blackout vs. Brownout
A blackout is a complete failure. The service is down. Health checks fail. Load balancers route around it. Failover triggers. The blast radius is known, and recovery is straightforward: restart, failover, done.
A brownout is partial failure. The service is slow, or flaky, or intermittently timing out. It’s hard to detect because health checks pass. Failover logic gets confused — is it down or just slow? The blast radius cascades unpredictably. Recovery is unclear because the problem is unclear.
| Characteristic | Blackout (Complete Failure) | Brownout (Partial Failure) |
|---|---|---|
| Service state | DOWN | SLOW or FLAKY |
| Detection difficulty | Easy — health checks fail | Hard — often passes health checks |
| Failover behavior | Triggers immediately | Confused — is it really down? |
| Blast radius | Known and contained | Cascading and unpredictable |
| Recovery path | Restart or failover | Often unclear |
Why Brownouts Are Worse
They Cascade
Consider a normal request path: Service A calls Service B, which queries a database. Under normal conditions, the entire chain completes in milliseconds. During a brownout in Service B, everything changes.
Service B starts responding slowly — say, 5 seconds instead of 50 milliseconds. Service A’s threads block waiting. Its connection pool exhausts. Service A starts timing out to its own callers. The cascade continues upstream until the entire request path degrades.
According to Parametrix’s 2024 Cloud Outage Report, critical cloud service interruptions lasted 18.7% longer in 2024 than 2023. In 2024 alone, six cloud outage events lasted more than ten hours each, totaling nearly 100 hours of downtime.
The October 2025 AWS Cascade
On October 20, 2025, AWS experienced a severe regional outage when a DynamoDB update contained a critical defect. This triggered cascading failures that affected 113 different AWS services for more than 15 hours. One slow service took down everything that depended on it.
They Exhaust Resources
Under normal operation, a connection to Service B is used for 50 milliseconds and returned to the pool. During a brownout, that same connection is held for 5 seconds — 100x longer. If your pool has 50 connections and you’re handling 10 requests per second, the pool exhausts in 5 seconds. After that, all requests fail — even to completely healthy services.
This is why brownouts often cause more damage than outages. A complete failure affects one path. A brownout can exhaust shared resources that affect everything.
They Defeat Circuit Breakers
Traditional circuit breakers monitor success and failure rates. If failures exceed a threshold, the circuit opens and stops sending traffic. But brownouts succeed — just slowly. A request that takes 4.9 seconds instead of timing out at 5 seconds counts as a success. The circuit never opens, but the system is unusable.
The Circuit Breaker Blind Spot
Standard circuit breakers watch for failures. Brownouts technically succeed. Your circuit breaker configuration might require 50% failures to open, but if requests are completing (slowly), you’ll never hit that threshold. Meanwhile, thread pools exhaust and cascades begin.
The Latency vs. Availability Trap
Your monitoring dashboard shows green across the board: Service B at 100% availability, 0% error rate, health checks passing. But users experience 30-second page loads, timeout errors in their browsers, and abandoned transactions.
This is the latency-availability trap. Availability metrics measure whether responses come back. They don’t measure whether responses come back in time.
| Metric Type | What It Measures | What It Misses |
|---|---|---|
| Availability (up/down) | Whether service responds | How long it takes |
| Error rate | 5xx responses | Slow successful responses |
| Health check | Basic connectivity | Actual request performance |
| P99 latency | Actual user experience | Nothing — this is what you need |
According to SRE best practices, teams should define latency SLOs with percentiles (e.g., P95 < 200ms, P99 < 500ms) and treat long-tail breaches separately from availability incidents. If your P99 is high, you’re violating SLAs even if availability is 100%.
Retry Storms: The Brownout Amplifier
When a database is slow (not down), the first request times out. The client retries immediately. That retry also times out. Meanwhile, more requests arrive, each spawning their own retries. Within seconds, the database sees 10x normal load — making it even slower, causing more timeouts, spawning more retries.
This is a retry storm. And it can turn a minor slowdown into complete collapse.
Retry Storm Amplification
Real-world examples of retry storms abound. The Encore blog documents a 2018 incident where enabling a feature for all customers simultaneously caused a thundering herd that overwhelmed infrastructure. These patterns repeat whenever systems experience sudden load spikes without proper backoff mechanisms.
Breaking the Retry Storm
According to AWS best practices, retries are like a powerful medicine — useful in the right dose, but capable of significant damage when overused. The key strategies:
Exponential backoff: Each retry waits longer than the last. First retry after 1 second, then 2, then 4, capped at some maximum. This spreads load over time instead of hammering immediately.
Jitter: Add randomness to backoff intervals. Without jitter, thousands of clients that timed out simultaneously will all retry simultaneously. With jitter, retries spread across a time window.
Single retry point: In a multi-layer system, retry at exactly one layer — not at every hop. As AWS notes, if each layer in a five-deep stack retries three times, a single database failure causes 243x amplification.
Latency-aware retry decisions: If the last request took 4.9 seconds before succeeding, don’t immediately retry the next timeout. The dependency is struggling — let it recover.
Timeout Configuration: The Art of Giving Up
Timeouts are decisions about when to stop waiting and give up. Too short, and you abandon requests that would have succeeded. Too long, and you hold resources during brownouts.
Determining Good Timeouts
According to Gergely Orosz’s analysis on resiliency, base timeout duration on your acceptable false timeout rate. If you can tolerate 0.1% of eventually-successful requests timing out, set your timeout at the 99.9th percentile of the downstream service’s response time.
The Timeout Stack
In a multi-layer system, timeouts must be coordinated. The outer layer’s timeout must exceed the sum of inner timeouts, with room for retries and overhead. If your gateway times out at 8 seconds, but the full downstream chain needs 10 seconds, you’ll abort work that was about to complete.
| Timeout Anti-Pattern | Problem | Result |
|---|---|---|
| Same timeout everywhere | Outer layers timeout before inner complete | Wasted work, zombie requests |
| No timeout at all | Request hangs forever | Thread exhaustion, memory leak |
| Timeout too high | Resources held too long | Pool exhaustion during brownout |
| Timeout too low | Valid slow requests fail | False failures, retry storms |
Adaptive Timeouts
Static timeouts assume stable latency. Real systems don’t work that way. An adaptive timeout tracks recent latency and adjusts accordingly — perhaps setting timeout at 2x the current P99, with a floor and ceiling. When the dependency speeds up, timeouts tighten. When it slows down, timeouts relax (within bounds).
Building Brownout Resilience
Deadline Propagation
Instead of each layer having independent timeouts, pass a deadline through the call chain. If the original request needs to complete in 10 seconds, downstream services receive “you have 8 seconds remaining.” If time runs out, abort early instead of doing work that can’t be returned in time.
Load Shedding
When latency exceeds acceptable thresholds, start rejecting some requests immediately. It’s better to fail fast for some users than to fail slowly for everyone. A simple approach: when current latency exceeds target, randomly reject 50% of incoming requests. This gives the system room to recover.
Latency-Aware Circuit Breakers
Extend your circuit breaker beyond success/failure to include latency violations. Track both failure count and latency-threshold violations. If either exceeds the limit, open the circuit. This catches brownouts that succeed technically but fail practically.
Why Timeouts Matter More Than Crashes
Consider two 10-minute incidents. In the first, a service crashes completely. The load balancer removes the instance within 30 seconds. Traffic routes to healthy instances. Users experience 30 seconds of failures, then normal operation.
In the second, a service enters brownout — 50% of requests are slow. The load balancer keeps routing traffic because the service is “healthy.” Half of all requests are slow. Thread pools exhaust. Cascades begin. Users experience 10 minutes of degraded service, getting worse as retry storms amplify.
Crash Impact
Failover removes unhealthy instance
Brownout Impact
Cascading degradation continues
The crash is dramatic but contained. The brownout is subtle but devastating. Understanding this difference is understanding real-world reliability.
References
- Parametrix: 2024 Cloud Outage Risk Report
- AWS and Azure Outages of October 2025: Analysis and Lessons
- Distributed Systems Horror Stories: The Thundering Herd Problem
- AWS: Timeouts, Retries, and Backoff with Jitter
- Latency SLO Best Practices
- Resiliency in Distributed Systems
This is part 1 of our “Dependency & Network Failure Intelligence” series. Next: Why Network Semantics Matter More Than Packet Loss.