Failure Is a Feature
What Is a Failure Catalog? A Practical Taxonomy of Real-World Breakages
A structured approach to categorizing and understanding the failure modes that actually take down production systems.
If you want to build reliable systems, you need to understand how they fail. Not in abstract terms, but in specific, reproducible, testable ways. That’s what a failure catalog provides — and according to IEEE research on chaos engineering, organizations with structured approaches to failure testing identify 84.3% of potential failure modes before production impact.
Failure Detection
Identified before production impact
Incident Reduction
With systematic testing
MTTR Improvement
Mean time to recovery
RTO Success
Systems meet recovery targets
Defining the Failure Catalog
A failure catalog is a structured enumeration of failure modes relevant to your technology stack. The concept borrows from Failure Mode and Effects Analysis (FMEA), a methodology developed by the U.S. military in the 1940s that systematically identifies potential failures, their causes, and their effects. Applied to software systems, a failure catalog becomes a field guide to everything that can go wrong — categorized, prioritized, and testable.
The key insight: semantic failures cause more outages than physical failures. The server rarely crashes. Instead, the API returns the wrong status code. The scheduler makes a bad decision. The configuration causes unexpected behavior. The rate limiter silently throttles critical requests.
Physical vs. Semantic Failures
Physical failures happen when infrastructure literally stops working — servers crash, network cables get unplugged, data centers lose power, disks fail. These are dramatic but rare. Modern cloud infrastructure is designed to handle them automatically, with redundancy at every layer.
Semantic failures are different. The infrastructure is running perfectly, but the behavior is wrong. An API returns incorrect data. A scheduler places pods on the wrong nodes. Configuration drift causes unexpected side effects. Rate limits silently block legitimate traffic. Certificates expire without warning.
| Characteristic | Physical Failures | Semantic Failures |
|---|---|---|
| Infrastructure status | Dead | Healthy but misbehaving |
| Frequency | Rare | Common |
| Detection | Usually obvious | Often subtle |
| Built-in handling | Cloud providers handle automatically | Application must handle explicitly |
| Testing cost | Expensive (need real infrastructure) | Cheap (mocks and simulators) |
The Hidden Majority
According to industry data, semantic failures (configuration errors, dependency issues, control plane problems) account for roughly 90% of production outages. Physical hardware failures account for less than 10%. Yet most resilience efforts focus on the dramatic minority.
The Four Pillars of Failure
A comprehensive failure catalog organizes failures into domains that map to how modern distributed systems are structured.
Control-Plane Failures
The brain of your system misbehaves. In Kubernetes, this means API server throttling (returning 429 Too Many Requests under load), scheduler failures (unable to find suitable nodes for pods), and etcd issues (leader elections causing temporary inconsistency). In cloud environments, it includes IAM propagation delays (permissions granted but not yet effective) and metadata service throttling (instance identity temporarily unavailable).
Control-plane failures are particularly insidious because they affect the management layer, not the workload layer. Everything might look fine from the application’s perspective, but new deployments fail, autoscaling stops working, and configuration changes don’t propagate.
Dependency Failures
The services you rely on misbehave. Databases hit connection limits, replicas fall behind with stale reads, or transactions deadlock under contention. Caches experience miss storms after restarts or evict critical data under memory pressure. External APIs rate-limit your requests, spike to multi-second latencies, or return errors for some percentage of calls.
Common Dependency Failure Modes
Dependency failures cascade. A database running slow puts back-pressure on your application, which queues requests, which exhausts memory, which triggers OOM kills, which creates more load on the remaining instances.
Configuration Failures
Settings that cause unexpected behavior. Resource limits set too low cause containers to get OOMKilled. Probe misconfiguration causes healthy containers to be terminated. Missing secrets cause applications to crash on startup. Service selectors that don’t match pod labels cause traffic to go nowhere. Ingress path conflicts route requests to the wrong backend. Network policies accidentally block legitimate traffic.
Configuration failures are particularly dangerous because they often pass validation. The YAML is syntactically correct. The values are within allowed ranges. But the combination produces behavior no one intended.
Observability Failures
You can’t see what’s happening. Metrics cardinality explosions cause Prometheus to run out of memory. Scrape failures create gaps in time-series data. Aggregation errors cause dashboards to show wrong values. Logging buffers overflow and drop entries. Trace sampling misses critical requests. Context propagation breaks and creates orphaned spans.
The Observability Paradox
When you most need visibility — during an incident — is exactly when observability systems are most likely to be overwhelmed. High load causes more logs, more metrics, more traces, which causes more load on observability infrastructure, which causes it to drop data or become unavailable.
Observability failures compound other failures. If you can’t see what’s happening, you can’t diagnose the problem. If you can’t diagnose, you can’t fix. Mean time to recovery stretches while engineers fly blind.
Why Catalogs Outperform Ad-Hoc Chaos
Chaos engineering takes a random approach: break things and see what happens. Failure catalogs take a systematic approach: enumerate every known failure mode and verify handling for each one.
| Aspect | Chaos Engineering | Failure Catalogs |
|---|---|---|
| Coverage | Random/probabilistic | Systematic/complete |
| Reproducibility | Non-deterministic | Fully reproducible |
| Learning | Discover unknown unknowns | Verify known failure handling |
| Cost | Requires production-like infra | Can use simulators |
| Best for | Finding new failure modes | Ensuring coverage of known modes |
According to research on chaos engineering effectiveness, organizations with structured hypothesis-driven approaches experience 68.9% higher success rates in identifying system vulnerabilities. Systems tested through systematic chaos experiments meet their recovery time objectives 82.4% of the time, compared to 57.3% for untested systems.
The ideal approach combines both: use chaos engineering to discover new failure modes, then add them to your catalog for systematic coverage going forward.
Building Your First Catalog
Incident Mining
Review your last 20 incidents and categorize them. Google’s SRE team uses standardized postmortem templates to capture root causes and triggers, enabling trend analysis that targets systemic improvements. For each incident, identify whether it was physical or semantic, which pillar it falls under (control-plane, dependency, configuration, observability), and what the specific failure mode was.
Most teams discover that 80% of their incidents fall into a handful of recurring categories. Those are the first entries in your catalog.
Stack Analysis
Map your technology stack to known failure modes. Kubernetes has well-documented failure patterns around control-plane throttling, scheduling deadlocks, and network policy issues. PostgreSQL has known issues with connection exhaustion, replication lag, and lock contention. Redis has failure modes around memory pressure, persistence, and cluster splits. Every technology in your stack has documented failure modes — the catalog collects them in one place.
Prioritize by Risk
Not all failures are equal. FMEA methodology suggests ranking by severity (how bad is it when it happens), frequency (how often does it happen), and detectability (how quickly can you find it). High severity, high frequency, low detectability failures get tested first.
Failure Priority Matrix (Example)
From Catalog to Action
A failure catalog isn’t documentation — it’s a testing framework. Each entry should specify how to simulate the failure, what behavior to expect from the system, and how to verify that behavior. Then you can run these scenarios in CI/CD, ensuring that every change is validated against your known failure modes.
The organizations that achieve high reliability aren’t the ones with the most redundant infrastructure. They’re the ones that systematically understand, test, and handle the failures that actually happen in production.
References
- IEEE Research: Chaos Engineering Effectiveness
- ACM Computing Surveys: Chaos Engineering Multi-Vocal Literature Review
- Google SRE Workbook: Postmortem Analysis
- ASQ: Failure Mode and Effects Analysis (FMEA)
- Wikipedia: Failure Mode and Effects Analysis
- Steadybit: What Is Chaos Engineering?
This is part 2 of our “Failure Is a Feature” series. Next: Semantic vs Physical Failures.