What Is a Failure Catalog? A Practical Taxonomy for SRE

If you want to build reliable systems, you need to understand how they fail. Not in abstract terms, but in specific, reproducible, testable ways. That’s what a failure catalog provides — and according to IEEE research on chaos engineering, organizations with structured approaches to failure testing identify 84.3% of potential failure modes before production impact. Where chaos engineering alone falls short, a catalog adds the systematic coverage; the Kubernetes failure catalog is one concrete example of this approach applied to a single stack.

Failure Detection

84%

Identified before production impact

Incident Reduction

43%

With systematic testing

MTTR Improvement

32%

Mean time to recovery

RTO Success

82%

Systems meet recovery targets

Defining the Failure Catalog

A failure catalog is a structured enumeration of failure modes relevant to your technology stack. The concept borrows from Failure Mode and Effects Analysis (FMEA), a methodology developed by the U.S. military in the 1940s that systematically identifies potential failures, their causes, and their effects. Applied to software systems, a failure catalog becomes a field guide to everything that can go wrong — categorized, prioritized, and testable.

The key insight: semantic failures cause more outages than physical failures. The server rarely crashes. Instead, the API returns the wrong status code. The scheduler makes a bad decision. The configuration causes unexpected behavior. The rate limiter silently throttles critical requests.

Physical vs. Semantic Failures

Physical failures happen when infrastructure literally stops working — servers crash, network cables get unplugged, data centers lose power, disks fail. These are dramatic but rare. Modern cloud infrastructure is designed to handle them automatically, with redundancy at every layer.

Semantic failures are different. The infrastructure is running perfectly, but the behavior is wrong. An API returns incorrect data. A scheduler places pods on the wrong nodes. Configuration drift causes unexpected side effects. Rate limits silently block legitimate traffic. Certificates expire without warning.

Characteristic	Physical Failures	Semantic Failures
Infrastructure status	Dead	Healthy but misbehaving
Frequency	Rare	Common
Detection	Usually obvious	Often subtle
Built-in handling	Cloud providers handle automatically	Application must handle explicitly
Testing cost	Expensive (need real infrastructure)	Cheap (mocks and simulators)

The Hidden Majority

According to industry data, semantic failures (configuration errors, dependency issues, control plane problems) account for roughly 90% of production outages. Physical hardware failures account for less than 10%. Yet most resilience efforts focus on the dramatic minority.

The Four Pillars of Failure

A comprehensive failure catalog organizes failures into domains that map to how modern distributed systems are structured.

Control-Plane Failures

The brain of your system misbehaves. In Kubernetes, this means API server throttling (returning 429 Too Many Requests under load), scheduler failures (unable to find suitable nodes for pods), and etcd issues (leader elections causing temporary inconsistency). In cloud environments, it includes IAM propagation delays (permissions granted but not yet effective) and metadata service throttling (instance identity temporarily unavailable).

Control-plane failures are particularly insidious because they affect the management layer, not the workload layer. Everything might look fine from the application’s perspective, but new deployments fail, autoscaling stops working, and configuration changes don’t propagate.

Dependency Failures

The services you rely on misbehave. Databases hit connection limits, replicas fall behind with stale reads, or transactions deadlock under contention. Caches experience miss storms after restarts or evict critical data under memory pressure. External APIs rate-limit your requests, spike to multi-second latencies, or return errors for some percentage of calls.

Common Dependency Failure Modes

Dependency failures cascade. A database running slow puts back-pressure on your application, which queues requests, which exhausts memory, which triggers OOM kills, which creates more load on the remaining instances.

Configuration Failures

Settings that cause unexpected behavior. Resource limits set too low cause containers to get OOMKilled. Probe misconfiguration causes healthy containers to be terminated. Missing secrets cause applications to crash on startup. Service selectors that don’t match pod labels cause traffic to go nowhere. Ingress path conflicts route requests to the wrong backend. Network policies accidentally block legitimate traffic.

Configuration failures are particularly dangerous because they often pass validation. The YAML is syntactically correct. The values are within allowed ranges. But the combination produces behavior no one intended.

Observability Failures

You can’t see what’s happening. Metrics cardinality explosions cause Prometheus to run out of memory. Scrape failures create gaps in time-series data. Aggregation errors cause dashboards to show wrong values. Logging buffers overflow and drop entries. Trace sampling misses critical requests. Context propagation breaks and creates orphaned spans.

The Observability Paradox

When you most need visibility — during an incident — is exactly when observability systems are most likely to be overwhelmed. High load causes more logs, more metrics, more traces, which causes more load on observability infrastructure, which causes it to drop data or become unavailable.

Observability failures compound other failures. If you can’t see what’s happening, you can’t diagnose the problem. If you can’t diagnose, you can’t fix. Mean time to recovery stretches while engineers fly blind.

Why Catalogs Outperform Ad-Hoc Chaos

Chaos engineering takes a random approach: break things and see what happens. Failure catalogs take a systematic approach: enumerate every known failure mode and verify handling for each one.

Aspect	Chaos Engineering	Failure Catalogs
Coverage	Random/probabilistic	Systematic/complete
Reproducibility	Non-deterministic	Fully reproducible
Learning	Discover unknown unknowns	Verify known failure handling
Cost	Requires production-like infra	Can use simulators
Best for	Finding new failure modes	Ensuring coverage of known modes

According to research on chaos engineering effectiveness, organizations with structured hypothesis-driven approaches experience 68.9% higher success rates in identifying system vulnerabilities. Systems tested through systematic chaos experiments meet their recovery time objectives 82.4% of the time, compared to 57.3% for untested systems.

The ideal approach combines both: use chaos engineering to discover new failure modes, then add them to your catalog for systematic coverage going forward.

Building Your First Catalog

Incident Mining

Review your last 20 incidents and categorize them. Google’s SRE team uses standardized postmortem templates to capture root causes and triggers, enabling trend analysis that targets systemic improvements. For each incident, identify whether it was physical or semantic, which pillar it falls under (control-plane, dependency, configuration, observability), and what the specific failure mode was.

Most teams discover that 80% of their incidents fall into a handful of recurring categories. Those are the first entries in your catalog.

Stack Analysis

Map your technology stack to known failure modes. Kubernetes has well-documented failure patterns around control-plane throttling, scheduling deadlocks, and network policy issues. PostgreSQL has known issues with connection exhaustion, replication lag, and lock contention. Redis has failure modes around memory pressure, persistence, and cluster splits. Every technology in your stack has documented failure modes — the catalog collects them in one place.

Prioritize by Risk

Not all failures are equal. FMEA methodology suggests ranking by severity (how bad is it when it happens), frequency (how often does it happen), and detectability (how quickly can you find it). High severity, high frequency, low detectability failures get tested first.

Failure Priority Matrix (Example)

From Catalog to Action

A failure catalog isn’t documentation — it’s a testing framework. Each entry should specify how to simulate the failure, what behavior to expect from the system, and how to verify that behavior. Then you can run these scenarios in CI/CD, ensuring that every change is validated against your known failure modes.

The organizations that achieve high reliability aren’t the ones with the most redundant infrastructure. They’re the ones that systematically understand, test, and handle the failures that actually happen in production.

References

This is part 2 of our “Failure Is a Feature” series. Next: Semantic vs Physical Failures.