Why Network Semantics Matter More Than Packet Loss

When network engineers think about failures, they think packets: dropped packets, corrupted packets, delayed packets. When application engineers experience network failures, they see something different: “connection refused,” “no such host,” “certificate expired.”

This is the semantic layer of networking. And it’s where the real problems live.

Human Error

68%

Cause of cloud outages (2024)

DNS Attack Impact

82%

Businesses hit by DNS outages

Cert Outages

72%

Orgs hit in past year

Cloud Outages

+166%

Global increase in 2024

Physical vs. Semantic Network Failures

Physical network failures — cable cuts, NIC failures, switch crashes, BGP misconfigurations — affect all traffic equally. When a link goes down, everything on that path stops.

Semantic network failures are different. DNS resolution fails, but only for specific domains. TLS handshakes fail, but only for services with expired certificates. Connection refused errors appear, but only for services that aren’t running. Network policies block traffic, but only between specific pod pairs.

The difference matters for how you test, monitor, and build resilience.

Failure Type	Physical Failures	Semantic Failures
Scope	All traffic on path	Specific connections only
Detection	Network monitoring tools	Application-level observability
Examples	Cable cut, switch crash	DNS failure, cert expiry, selector mismatch
Chaos testing	tc netem packet loss	Requires semantic fault injection
Recovery	Hardware/routing fix	Configuration/cert/DNS fix

Why Packet Loss Testing Is Overrated

Chaos engineering loves packet loss. It’s easy to simulate with a single command. But modern networks rarely experience significant packet loss — well-designed infrastructure runs at less than 0.01% loss, and TCP automatically retransmits.

The Chaos Testing Gap

Simulating 10% packet loss might cause slight latency increases and some retries. Your app probably handles it fine. But DNS resolution failing? Certificate expired? Service selector mismatch? Those cause 100% failure with immediate impact — and most chaos tests never touch them.

According to ThousandEyes 2024 analysis, configuration changes were behind many of the year’s major outages. The ratio of cloud service provider outages to ISP outages shifted dramatically — CSP outages climbed from 17% to 27% of all outages.

Root Causes of Network Incidents

DNS Failures: The Silent Killer

DNS is the foundation everything else builds on. When it breaks, nothing works — and the errors often look like something else entirely.

AWS October 2025 DNS Outage

On October 20, 2025, AWS experienced a major disruption in US-EAST-1 stemming from a race condition in DynamoDB’s automated DNS management system. DNS records were incorrectly deleted, causing cascading failures across more than 70 AWS services for approximately 15 hours. This is a canonical example of a cloud control-plane failure — the management layer itself becoming the dependency that fails.

According to DNS security research, 82% of businesses have suffered application outages as a direct result of DNS intrusions, with nearly 90% of organizations experiencing DNS attacks in the past year. The average cost of each attack: $950,000.

How DNS Breaks

Resolution failure: The DNS server returns NXDOMAIN (domain doesn’t exist) or SERVFAIL (server error). Complete service unavailability.

Stale cache: DNS cache wasn’t refreshed after a migration. Traffic routes to old, dead endpoints while the cache TTL expires.

TTL problems: TTL set too short combined with DNS latency causes intermittent resolution failures as caches constantly expire and re-query.

The Kubernetes ndots Problem

Kubernetes sets ndots:5 by default in pod DNS configuration. This means any domain with fewer than 5 dots gets the search domains appended before trying the absolute name.

When your pod tries to resolve api.external.com, it actually makes these queries in order:

api.external.com.my-namespace.svc.cluster.local (fail)
api.external.com.svc.cluster.local (fail)
api.external.com.cluster.local (fail)
api.external.com.us-east-1.compute.internal (fail)
api.external.com (success!)

That’s 5 DNS queries instead of 1 for every external domain lookup. According to Kubernetes DNS optimization guides, this can cause massive latency — once latency between pods and CoreDNS reaches one second or more, it impacts customer SLAs.

Fix the ndots Problem

Three approaches, in order of preference: Enable NodeLocal DNSCache on your cluster. Use fully qualified domain names with a trailing dot (api.external.com.). Reduce ndots to 2 for specific pods that make heavy external calls.

Certificate Expiry: Surprisingly Common

You’d think certificate management would be a solved problem by now. It isn’t.

Cert Outages

72%

Orgs hit in past year

Avg Cost

$2.86M

Per certificate outage

Recovery Time

5.3 hrs

To identify + fix

According to Keyfactor’s 2024 PKI Report, organizations experienced an average of three outages caused by expired certificates over a 24-month period. It takes 2.6 hours to identify the root cause and another 2.7 hours to remediate.

High-profile examples include Google Voice going down for 4+ hours due to an expired TLS certificate, Microsoft Teams leaving 20 million users without access for 3 hours from an expired auth certificate, and SpaceX’s Starlink going down globally due to an “expired ground station certificate.”

The 47-Day Future

Apple is pushing toward TLS certificate validity of just 47 days by 2029. If your organization struggles with annual certificate renewals, you need automated certificate lifecycle management — manual processes won’t scale.

Selector Mismatches: The Configuration Trap

In Kubernetes, services find pods via label selectors. When they don’t match, the network “works” but nothing connects.

The scenario: You deploy a new version and update the pod labels to version: v2. But you forget to update the service selector, which still looks for version: v1. The service exists. The pods exist. But the endpoints list is empty. Every connection fails with “no endpoints available.”

This is invisible to traditional network monitoring. The network is fine. DNS resolves. Ports are open. But the semantic layer — the mapping from service name to actual pod IPs — is broken.

Symptom	Possible Semantic Cause	Debug Command
Connection refused	No pods matching selector	kubectl get endpoints <service>
502 Bad Gateway	Backend pods not ready	kubectl get pods -l <selector>
SSL certificate error	Cert doesn't match hostname	openssl s_client -connect host:443
Intermittent failures	Some pods unhealthy	kubectl describe endpoints

Network Policy Blocking

A default-deny network policy in Kubernetes blocks all traffic unless explicitly allowed. Everything looks correct — pods running, services configured, DNS resolving — but connections timeout or get refused.

The symptom is indistinguishable from a dozen other problems. The only way to debug is checking network policies, which most application developers never think to look at.

Building Semantic Network Resilience

DNS Resilience

Cache DNS results locally where possible. Monitor DNS resolution time as a key metric. Alert on NXDOMAIN responses for services that should exist. Consider NodeLocal DNSCache in Kubernetes to reduce CoreDNS bottlenecks.

Connection Resilience

Different failure modes need different handling. Connection refused means the service isn’t listening — don’t hammer it with retries. DNS errors might be transient — wait and retry. Timeouts could be network congestion — use exponential backoff.

Observability

The metrics that matter for semantic failures:

DNS: Resolution duration, failure counts by error type, cache hit ratio

Connections: Connection refused count, timeout count, reset count — broken down by destination

Kubernetes: Endpoint count per service (zero is bad), network policy denied connections

Why This Matters for AIOps

Traditional monitoring sees: “connection failed.”

Semantic understanding sees the difference between DNS resolution failure (check CoreDNS), connection refused (check if pods are running), TLS error (check certificate expiry), and empty endpoints (check label selectors).

This is the intelligence layer that transforms alert noise into actionable diagnosis — understanding not just that something failed, but why it failed and what to fix.

References

This is part 2 of our “Dependency & Network Failure Intelligence” series. See part 1: Most Outages Are Brownouts.