SLOs, SLIs, and Error Budgets: A Practical SRE Guide

Service Level Objectives (SLOs) are the foundation of modern reliability engineering. But according to Google’s SLO Adoption Survey, most organizations embrace SRE practices but fail to engage fully with SLOs. That’s leaving value on the table — organizations that effectively manage SLOs and error budgets report 20% improvements in reliability and 30% faster incident response. SLOs are also what give teeth to a failure catalog: without budgets, even known failure modes drift back into “nice to fix someday”.

Reliability Gain

20%

With effective error budgets

Faster Response

30%

Incident response improvement

Outage Source

70%

Caused by changes

Common SLO

99.9%

Most common target

The Three Concepts

SLI — Service Level Indicator

An SLI is a quantitative measure of service behavior. It answers the question: “How is the service performing right now?” The formula is simple: good events divided by total events, expressed as a percentage.

For a web service, an SLI might be the percentage of requests that complete successfully. For a data pipeline, it might be the percentage of records processed within the freshness window. For a payment system, it might be the percentage of transactions that settle correctly.

SLI Selection Matters

The best SLIs measure what users actually experience. Request success rate from the server’s perspective might look great while users experience failures due to client-side issues, network problems, or slow responses that they abandon. Measure from as close to the user as possible.

SLO — Service Level Objective

An SLO is a target value for an SLI, measured over a time window. It answers the question: “How reliable do we commit to being?”

When you set an SLO of 99.9% availability over a 28-day rolling window, you’re saying that 99.9% of requests should succeed. That sounds nearly perfect, but the math reveals the reality: 99.9% uptime means you can be down for about 43 minutes per month, or roughly 1.5 minutes per day.

SLO Target	Monthly Downtime	Yearly Downtime
99% (two nines)	~7 hours	~3.5 days
99.9% (three nines)	~43 minutes	~8.7 hours
99.99% (four nines)	~4.3 minutes	~52 minutes
99.999% (five nines)	~26 seconds	~5.2 minutes

Each additional nine requires roughly 10x more engineering effort. Going from 99.9% to 99.99% isn’t just “a little better” — it’s a fundamentally different class of system that requires redundancy, automated failover, and rigorous testing that most teams can’t justify.

Error Budget

The error budget is the inverse of the SLO — the acceptable amount of unreliability. If your SLO is 99.9%, your error budget is 0.1%, which translates to about 43 minutes of downtime per month that you can “spend” without violating your commitment.

This reframing changes everything. Instead of reliability being an infinite obligation (“never have outages”), it becomes a finite resource to manage (“we have 43 minutes this month, let’s spend them wisely”).

Why Error Budgets Work

According to Google’s SRE book, error budgets eliminate structural tension between development and operations. When the budget is healthy, developers can ship features and take risks. When the budget is low, everyone prioritizes stability. The team becomes self-policing because they share a common, data-driven measure.

From User Journeys to SLOs

The most common mistake is setting SLOs based on what sounds impressive rather than what users actually need. A 99.99% SLO sounds better than 99.9%, but if you can’t achieve it, you’re setting yourself up for constant failure and eroded trust.

Start with critical user journeys. What are the top five actions users take with your service? For each journey, identify what “success” means in measurable terms. A dashboard load might need to complete in under 2 seconds. A payment might need to process without errors. A search query might need to return results within 500 milliseconds.

Then baseline your current performance. Query your logs or metrics to understand your actual P50, P90, and P99 latencies. Check your actual error rates over the past month. Set initial SLOs slightly below your current performance — this gives you room for normal variance without immediately burning budget.

Typical SLI Categories by Service Type

Availability (did it work?) and latency (was it fast?) are almost universally relevant. Throughput matters for batch systems. Correctness matters for data pipelines. Freshness matters for real-time dashboards and analytics. Pick 3-5 SLIs that cover what users care about — more than that creates noise rather than signal.

The Error Budget Policy

An SLO without consequences is just a dashboard. The error budget policy defines what happens as budget depletes.

Budget above 50% (Green): Normal operations. Feature development proceeds at full velocity. Teams can experiment and take calculated risks. This is where you want to be most of the time.

Budget between 20-50% (Yellow): Caution. Reliability work gets prioritized. Risky deployments require additional review. The team increases monitoring attention and may slow release cadence.

Budget below 20% (Red): Feature freeze. All engineering effort goes toward reliability improvements. Postmortems are mandatory for any new incidents. Leadership visibility increases.

Budget exhausted (Critical): Complete change freeze except for critical fixes. A war room is established to address the root causes. No new deployments until the service is back within SLO and the team understands what went wrong.

The Policy Must Have Teeth

According to Google’s SRE Workbook, if a single incident consumes more than 20% of the error budget, a postmortem is mandatory with at least one high-priority action item. Without enforcement, error budget policies become suggestions that teams ignore under deadline pressure.

Common Mistakes to Avoid

Too many SLOs. Start with 3-5 SLOs that cover critical user journeys. Fifty SLOs means no one pays attention to any of them.

Too aggressive targets. A 99.99% SLO sounds impressive, but if your historical performance is 99.5%, you’ll spend all your time in budget deficit. Set achievable targets and tighten them as you improve.

Internal metrics instead of user experience. CPU utilization and pod restarts matter to operators, but they don’t directly reflect user experience. Focus on what users actually see — most production pain comes from brownouts, not hard outages, and only user-facing SLIs catch those.

No action on budget burn. An SLO without a budget policy is just monitoring. Define what happens at each threshold and stick to it.

Measuring over wrong windows. A 24-hour window is too sensitive to normal variance. A 90-day window is too slow to respond to real problems. 28-day rolling windows balance signal with actionability.

Implementation Path

The implementation is iterative, not waterfall. Start by identifying your top three critical user journeys and defining SLIs for each. Use your existing observability stack to measure current performance over the past 28 days. Set initial SLOs at or slightly below current performance.

Build dashboards that show current SLI values, error budget remaining, and burn rate. Burn rate is crucial — knowing that you have 60% budget remaining is less actionable than knowing you’re burning at 3x normal rate and will exhaust budget in four days.

Set up alerts based on burn rate, not just SLI values. A brief dip below SLO might not matter if budget is healthy. A slow burn over days matters more than a spike that recovers quickly.

Document your error budget policy with specific thresholds and actions. Get buy-in from product and engineering leadership. Then enforce it — the first time you freeze features due to budget exhaustion, the organization learns that SLOs are real.

Review monthly. Did you meet targets? Were there incidents? What did you learn? Should targets be tightened because you’re hitting them easily, or loosened because they’re causing constant friction?

The Bottom Line

SLOs aren’t about achieving perfection — they’re about making reliability a first-class engineering concern with explicit trade-offs. When you set a 99.9% SLO, you’re acknowledging that 0.1% unreliability is acceptable and focusing your energy on staying within that budget rather than chasing an impossible zero.

The error budget is what makes this work. It turns reliability from an open-ended obligation into a measurable resource. It gives product and engineering a shared language for discussing risk. And it creates a self-regulating system where teams naturally balance feature velocity with stability.

Start small. Pick your most critical service. Define one availability SLI and one latency SLI. Set achievable targets. Build the dashboard. Write the policy. Then iterate.

References

Part of our SRE Practices series. Next: Implementing On-Call that Doesn’t Suck.