The Kubernetes Failure Catalog: What Actually Breaks in Production

A comprehensive taxonomy of Kubernetes failure modes based on real production incidents and patterns.

Kubernetes is complex. With that complexity comes a rich taxonomy of failure modes that show up in production every day. According to Komodor’s 2025 Enterprise Kubernetes Report, 38% of companies experience high-impact outages weekly, and 62% estimate major downtime costs at over $1 million per hour. Understanding these failure patterns is the foundation of building resilient systems.

38%

Companies with high-impact outages

$1M/hr

Estimated by 62% of enterprises

40 min

Median time to detect

50 min

Median time to resolve

The Five Failure Domains

Kubernetes failures fall into five distinct domains, each with different symptoms, detection methods, and remediation approaches. Control plane failures affect the brain of the cluster. Workload failures happen where applications meet the platform. Configuration failures are the silent killers — valid YAML that does the wrong thing. Networking failures break connectivity in ways that are often intermittent and hard to diagnose. And storage failures bring unique challenges for stateful workloads.

The data above represents relative complexity, not frequency. Configuration failures happen most often but are usually straightforward to diagnose. Control plane and networking failures are rarer but can take hours to resolve when they occur.

Control Plane Failures

The control plane is the brain of Kubernetes. When it misbehaves, everything suffers — deployments stall, autoscaling stops, and operators lose visibility into cluster state.

API Server Failures

The API server handles every request to the cluster. When it’s overwhelmed, you see slow kubectl responses, deployment timeouts, and webhook failures. The root causes are usually too many LIST requests from controllers watching large resource sets, controller storms where multiple operators fight over the same resources, or aggressive client polling from monitoring tools.

Detection is straightforward: watch the apiserver_request_total{code="429"} metric. Any sustained 429 responses indicate throttling. For latency issues, track apiserver_request_duration_seconds at the 99th percentile — anything over one second indicates problems.

API Server Latency Chain

API server latency often cascades from etcd. If etcd is slow due to disk I/O, network latency to peers, or compaction backlog, every API request suffers. Always check etcd metrics before blaming the API server.

Scheduler Failures

When pods get stuck in Pending with no events, the scheduler can’t find suitable nodes. This happens with impossible affinity rules (requiring pods to co-locate with pods that don’t exist), all nodes having taints that pods don’t tolerate, or resource fragmentation where total cluster resources are sufficient but no single node can fit the pod.

The scheduler failure that catches teams off guard is the logical deadlock: Pod A requires Pod B, and Pod B requires Pod A. Neither can schedule first. The solution is using soft affinity (preferredDuringScheduling) instead of hard requirements.

etcd Failures

etcd is the source of truth for cluster state. Leader elections cause temporary unavailability — watch etcd_server_leader_changes_seen_total and alert if you see more than five changes in ten minutes. During elections, writes may fail and reads may return stale data. This is usually self-healing, but frequent elections indicate network instability or disk I/O problems on etcd nodes.

Workload Failures

Where applications meet the platform. These are the failures operators deal with daily.

CrashLoopBackOff

One of the most common Kubernetes issues, CrashLoopBackOff happens when a container repeatedly crashes and Kubernetes keeps trying to restart it. The backoff is exponential — 10 seconds, then 20, then 40, up to a maximum of 5 minutes between restart attempts.

CauseSymptomsFix
Application crashExit code 1, error in logsFix application bug or config
Missing secrets/configsImmediate crash, mount errorsCreate missing ConfigMap/Secret
Failing liveness probePod killed mid-startupIncrease initialDelaySeconds
OOMKilledExit code 137, no app logsIncrease memory limits
Port conflictsBind error in logsChange port or remove conflict

The debugging sequence: first, check kubectl logs <pod> --previous to see what the container printed before crashing. Then kubectl describe pod <pod> to see events, exit codes, and whether it was OOMKilled. Exit code 137 means the container was killed by the system (usually OOM). Exit code 1 means the application crashed.

Pending Pods

Pods stuck in Pending have passed admission but the scheduler can’t place them. Check the Events section in kubectl describe pod — the scheduler logs exactly why it rejected each node. Common causes include insufficient resources (no node has enough CPU/memory available), unsatisfiable node selectors (the required label doesn’t exist), unbound PVCs (storage isn’t ready), and image pull failures.

OOMKilled

Container exits with code 137, often with no application logs because the kernel killed it suddenly. Memory limits are too low for the workload, there’s a memory leak accumulating over time, or an unexpected traffic spike exceeded normal memory usage. The fix is profiling actual memory usage, setting appropriate limits, and implementing graceful degradation for traffic spikes.

Configuration Failures

The silent killers. These manifest as valid YAML that Kubernetes accepts but that causes operational problems.

Resource Specification Errors

Resources with requests higher than limits are invalid and rejected. But there are subtler problems: requests set too low cause throttling under load, limits set too low cause OOMKilled containers, and no limits set at all allows noisy neighbors to starve other workloads.

The Right Resource Formula

Set requests to what you typically need — this is what Kubernetes uses for scheduling. Set limits with headroom for spikes — this is the kill threshold. A common pattern is limits at 2x requests for memory and no CPU limits (CPU throttling is usually worse than allowing bursts).

Probe Misconfiguration

Probes cause more unexpected failures than almost any other configuration. The most common mistake is initialDelaySeconds set too short — the container gets killed during slow startup because Kubernetes thinks it’s unhealthy. The fix is using a startup probe that runs first, giving the application time to initialize before liveness checks begin.

Other probe problems include timeoutSeconds too short (causing intermittent failures under load), wrong endpoint (probe path doesn’t exist), and confusing liveness with readiness. Liveness failures restart the container. Readiness failures remove it from service endpoints. Using liveness when you meant readiness causes unnecessary restarts.

Networking Failures

Where complexity meets connectivity. Network failures are often intermittent, making them hard to reproduce and diagnose.

Service Discovery Failures

Connection refused to a service usually means no endpoints exist. Check kubectl get endpoints <service> — if it’s empty, either the selector doesn’t match any pod labels, all matching pods are not Ready, or the pods are in a different namespace. Compare the service selector with actual pod labels carefully; a typo in a single label key causes complete service failure.

DNS Resolution Failures

“Name or service not known” errors point to DNS problems. CoreDNS might be overloaded, especially in large clusters with many services. The ndots configuration matters — the default of 5 means every DNS lookup tries multiple suffixes before querying the actual name, which can overload DNS. Network policies might be blocking UDP/TCP port 53 to CoreDNS.

Debug with kubectl exec <pod> -- nslookup <service> to verify resolution works from inside the cluster. Check CoreDNS logs for errors or high latency.

Ingress Failures

Ingress problems manifest as HTTP errors. 502 Bad Gateway means the backend is unhealthy or unreachable — check that backend pods are Ready and that the service endpoints exist. SSL handshake failures indicate missing or expired TLS secrets. Traffic routing to the wrong backend happens with path conflicts or annotation typos.

Storage Failures

Stateful workloads bring stateful problems. Storage failures often involve external systems that Kubernetes can’t directly observe.

PVC Binding Failures

PVCs stuck in Pending mean the storage provisioner can’t satisfy the request. Common causes include requesting a StorageClass that doesn’t exist, requesting more storage than the provisioner can allocate, zone mismatches (the volume exists in a different availability zone than the node), and quota exhaustion.

Check kubectl describe pvc <name> for events from the provisioner. Verify the StorageClass exists and that the provisioner pod is running.

Real Incident Patterns

Understanding how failures cascade is as important as understanding individual failure modes.

Cascading Scheduler Failure

A single node goes into maintenance. Pods reschedule to remaining nodes. This increases resource pressure on those nodes. New deployments start failing with Pending pods because there’s no room. Monitoring pods can’t schedule either. Visibility degrades just as the incident escalates. The whole sequence from maintenance start to visibility loss can happen in under 30 minutes.

The prevention is headroom. Don’t run clusters at 80%+ resource utilization. Reserve capacity for rescheduling during maintenance and incidents.

Certificate Expiration Cascade

A webhook serving certificate expires. All mutating webhook calls fail because TLS handshakes fail. All pod creations fail because they can’t pass admission. Rollouts fail, autoscaling stops, and the cluster is frozen. This takes 30 minutes to notice and another 30 to remediate because renewing certificates under pressure is error-prone.

Prevent Certificate Cascades

Use cert-manager or similar tools to automate certificate renewal. Alert when certificates are within 30 days of expiration. Never let a webhook certificate become a single point of failure.

Building Resilience

For each failure type, build four layers of defense. First, detection — how do you know it’s happening? Metrics, logs, and synthetic probes all play a role. Second, alerting — when should humans be notified? Not every blip deserves a page. Third, runbooks — what are the remediation steps? Document them before the incident. Fourth, automation — can you fix it without human intervention? Self-healing isn’t just a buzzword; it’s the only way to handle failures at scale.

Over 60% of ops teams’ time goes to troubleshooting, and only 20% of incidents are resolved without escalation. Building systematic failure intelligence is how teams escape the firefighting cycle.

References


This is part 1 of our “Kubernetes Failure Catalog” series. Next: Simulating Control-Plane Failures.