Dependency & Network Failure Intelligence
Why Network Semantics Matter More Than Packet Loss
Understanding semantic network failures like DNS issues, certificate expiry, and configuration errors that cause real outages.
When network engineers think about failures, they think packets: dropped packets, corrupted packets, delayed packets. When application engineers experience network failures, they see something different: “connection refused,” “no such host,” “certificate expired.”
This is the semantic layer of networking. And it’s where the real problems live.
Human Error
Cause of cloud outages (2024)
DNS Attack Impact
Businesses hit by DNS outages
Cert Outages
Orgs hit in past year
Cloud Outages
Global increase in 2024
Physical vs. Semantic Network Failures
Physical network failures — cable cuts, NIC failures, switch crashes, BGP misconfigurations — affect all traffic equally. When a link goes down, everything on that path stops.
Semantic network failures are different. DNS resolution fails, but only for specific domains. TLS handshakes fail, but only for services with expired certificates. Connection refused errors appear, but only for services that aren’t running. Network policies block traffic, but only between specific pod pairs.
The difference matters for how you test, monitor, and build resilience.
| Failure Type | Physical Failures | Semantic Failures |
|---|---|---|
| Scope | All traffic on path | Specific connections only |
| Detection | Network monitoring tools | Application-level observability |
| Examples | Cable cut, switch crash | DNS failure, cert expiry, selector mismatch |
| Chaos testing | tc netem packet loss | Requires semantic fault injection |
| Recovery | Hardware/routing fix | Configuration/cert/DNS fix |
Why Packet Loss Testing Is Overrated
Chaos engineering loves packet loss. It’s easy to simulate with a single command. But modern networks rarely experience significant packet loss — well-designed infrastructure runs at less than 0.01% loss, and TCP automatically retransmits.
The Chaos Testing Gap
Simulating 10% packet loss might cause slight latency increases and some retries. Your app probably handles it fine. But DNS resolution failing? Certificate expired? Service selector mismatch? Those cause 100% failure with immediate impact — and most chaos tests never touch them.
According to ThousandEyes 2024 analysis, configuration changes were behind many of the year’s major outages. The ratio of cloud service provider outages to ISP outages shifted dramatically — CSP outages climbed from 17% to 27% of all outages.
Root Causes of Network Incidents
DNS Failures: The Silent Killer
DNS is the foundation everything else builds on. When it breaks, nothing works — and the errors often look like something else entirely.
AWS October 2025 DNS Outage
On October 20, 2025, AWS experienced a major disruption in US-EAST-1 stemming from a race condition in DynamoDB’s automated DNS management system. DNS records were incorrectly deleted, causing cascading failures across more than 70 AWS services for approximately 15 hours. Downdetector recorded over 6.5 million user reports globally.
According to DNS security research, 82% of businesses have suffered application outages as a direct result of DNS intrusions, with nearly 90% of organizations experiencing DNS attacks in the past year. The average cost of each attack: $950,000.
How DNS Breaks
Resolution failure: The DNS server returns NXDOMAIN (domain doesn’t exist) or SERVFAIL (server error). Complete service unavailability.
Stale cache: DNS cache wasn’t refreshed after a migration. Traffic routes to old, dead endpoints while the cache TTL expires.
TTL problems: TTL set too short combined with DNS latency causes intermittent resolution failures as caches constantly expire and re-query.
The Kubernetes ndots Problem
Kubernetes sets ndots:5 by default in pod DNS configuration. This means any domain with fewer than 5 dots gets the search domains appended before trying the absolute name.
When your pod tries to resolve api.external.com, it actually makes these queries in order:
api.external.com.my-namespace.svc.cluster.local(fail)api.external.com.svc.cluster.local(fail)api.external.com.cluster.local(fail)api.external.com.us-east-1.compute.internal(fail)api.external.com(success!)
That’s 5 DNS queries instead of 1 for every external domain lookup. According to Kubernetes DNS optimization guides, this can cause massive latency — once latency between pods and CoreDNS reaches one second or more, it impacts customer SLAs.
Fix the ndots Problem
Three approaches, in order of preference: Enable NodeLocal DNSCache on your cluster. Use fully qualified domain names with a trailing dot (api.external.com.). Reduce ndots to 2 for specific pods that make heavy external calls.
Certificate Expiry: Surprisingly Common
You’d think certificate management would be a solved problem by now. It isn’t.
Cert Outages
Orgs hit in past year
Avg Cost
Per certificate outage
Recovery Time
To identify + fix
According to Keyfactor’s 2024 PKI Report, organizations experienced an average of three outages caused by expired certificates over a 24-month period. It takes 2.6 hours to identify the root cause and another 2.7 hours to remediate.
High-profile examples include Google Voice going down for 4+ hours due to an expired TLS certificate, Microsoft Teams leaving 20 million users without access for 3 hours from an expired auth certificate, and SpaceX’s Starlink going down globally due to an “expired ground station certificate.”
The 47-Day Future
Apple is pushing toward TLS certificate validity of just 47 days by 2029. If your organization struggles with annual certificate renewals, you need automated certificate lifecycle management — manual processes won’t scale.
Selector Mismatches: The Configuration Trap
In Kubernetes, services find pods via label selectors. When they don’t match, the network “works” but nothing connects.
The scenario: You deploy a new version and update the pod labels to version: v2. But you forget to update the service selector, which still looks for version: v1. The service exists. The pods exist. But the endpoints list is empty. Every connection fails with “no endpoints available.”
This is invisible to traditional network monitoring. The network is fine. DNS resolves. Ports are open. But the semantic layer — the mapping from service name to actual pod IPs — is broken.
| Symptom | Possible Semantic Cause | Debug Command |
|---|---|---|
| Connection refused | No pods matching selector | kubectl get endpoints <service> |
| 502 Bad Gateway | Backend pods not ready | kubectl get pods -l <selector> |
| SSL certificate error | Cert doesn't match hostname | openssl s_client -connect host:443 |
| Intermittent failures | Some pods unhealthy | kubectl describe endpoints |
Network Policy Blocking
A default-deny network policy in Kubernetes blocks all traffic unless explicitly allowed. Everything looks correct — pods running, services configured, DNS resolving — but connections timeout or get refused.
The symptom is indistinguishable from a dozen other problems. The only way to debug is checking network policies, which most application developers never think to look at.
Building Semantic Network Resilience
DNS Resilience
Cache DNS results locally where possible. Monitor DNS resolution time as a key metric. Alert on NXDOMAIN responses for services that should exist. Consider NodeLocal DNSCache in Kubernetes to reduce CoreDNS bottlenecks.
Connection Resilience
Different failure modes need different handling. Connection refused means the service isn’t listening — don’t hammer it with retries. DNS errors might be transient — wait and retry. Timeouts could be network congestion — use exponential backoff.
Observability
The metrics that matter for semantic failures:
DNS: Resolution duration, failure counts by error type, cache hit ratio
Connections: Connection refused count, timeout count, reset count — broken down by destination
Kubernetes: Endpoint count per service (zero is bad), network policy denied connections
Why This Matters for AIOps
Traditional monitoring sees: “connection failed.”
Semantic understanding sees the difference between DNS resolution failure (check CoreDNS), connection refused (check if pods are running), TLS error (check certificate expiry), and empty endpoints (check label selectors).
This is the intelligence layer that transforms alert noise into actionable diagnosis — understanding not just that something failed, but why it failed and what to fix.
References
- ThousandEyes: 2024 Outage Trends
- AWS October 2025 DNS Outage Analysis
- DNS Security Statistics 2025
- Kubernetes DNS ndots Performance Impact
- Keyfactor 2024 PKI and Digital Trust Report
- Parametrix: 2024 Cloud Outage Report
This is part 2 of our “Dependency & Network Failure Intelligence” series.