SRE Practices
SLOs, SLIs, and Error Budgets: A Practical Guide
Demystifying service level objectives and how to implement them effectively in your organization.
Service Level Objectives (SLOs) are the foundation of modern reliability engineering. But according to Google’s SLO Adoption Survey, most organizations embrace SRE practices but fail to engage fully with SLOs. That’s leaving value on the table — organizations that effectively manage SLOs and error budgets report 20% improvements in reliability and 30% faster incident response.
Reliability Gain
With effective error budgets
Faster Response
Incident response improvement
Outage Source
Caused by changes
Common SLO
Most common target
The Three Concepts
SLI — Service Level Indicator
An SLI is a quantitative measure of service behavior. It answers the question: “How is the service performing right now?” The formula is simple: good events divided by total events, expressed as a percentage.
For a web service, an SLI might be the percentage of requests that complete successfully. For a data pipeline, it might be the percentage of records processed within the freshness window. For a payment system, it might be the percentage of transactions that settle correctly.
SLI Selection Matters
The best SLIs measure what users actually experience. Request success rate from the server’s perspective might look great while users experience failures due to client-side issues, network problems, or slow responses that they abandon. Measure from as close to the user as possible.
SLO — Service Level Objective
An SLO is a target value for an SLI, measured over a time window. It answers the question: “How reliable do we commit to being?”
When you set an SLO of 99.9% availability over a 28-day rolling window, you’re saying that 99.9% of requests should succeed. That sounds nearly perfect, but the math reveals the reality: 99.9% uptime means you can be down for about 43 minutes per month, or roughly 1.5 minutes per day.
| SLO Target | Monthly Downtime | Yearly Downtime |
|---|---|---|
| 99% (two nines) | ~7 hours | ~3.5 days |
| 99.9% (three nines) | ~43 minutes | ~8.7 hours |
| 99.99% (four nines) | ~4.3 minutes | ~52 minutes |
| 99.999% (five nines) | ~26 seconds | ~5.2 minutes |
Each additional nine requires roughly 10x more engineering effort. Going from 99.9% to 99.99% isn’t just “a little better” — it’s a fundamentally different class of system that requires redundancy, automated failover, and rigorous testing that most teams can’t justify.
Error Budget
The error budget is the inverse of the SLO — the acceptable amount of unreliability. If your SLO is 99.9%, your error budget is 0.1%, which translates to about 43 minutes of downtime per month that you can “spend” without violating your commitment.
This reframing changes everything. Instead of reliability being an infinite obligation (“never have outages”), it becomes a finite resource to manage (“we have 43 minutes this month, let’s spend them wisely”).
Why Error Budgets Work
According to Google’s SRE book, error budgets eliminate structural tension between development and operations. When the budget is healthy, developers can ship features and take risks. When the budget is low, everyone prioritizes stability. The team becomes self-policing because they share a common, data-driven measure.
From User Journeys to SLOs
The most common mistake is setting SLOs based on what sounds impressive rather than what users actually need. A 99.99% SLO sounds better than 99.9%, but if you can’t achieve it, you’re setting yourself up for constant failure and eroded trust.
Start with critical user journeys. What are the top five actions users take with your service? For each journey, identify what “success” means in measurable terms. A dashboard load might need to complete in under 2 seconds. A payment might need to process without errors. A search query might need to return results within 500 milliseconds.
Then baseline your current performance. Query your logs or metrics to understand your actual P50, P90, and P99 latencies. Check your actual error rates over the past month. Set initial SLOs slightly below your current performance — this gives you room for normal variance without immediately burning budget.
Typical SLI Categories by Service Type
Availability (did it work?) and latency (was it fast?) are almost universally relevant. Throughput matters for batch systems. Correctness matters for data pipelines. Freshness matters for real-time dashboards and analytics. Pick 3-5 SLIs that cover what users care about — more than that creates noise rather than signal.
The Error Budget Policy
An SLO without consequences is just a dashboard. The error budget policy defines what happens as budget depletes.
Budget above 50% (Green): Normal operations. Feature development proceeds at full velocity. Teams can experiment and take calculated risks. This is where you want to be most of the time.
Budget between 20-50% (Yellow): Caution. Reliability work gets prioritized. Risky deployments require additional review. The team increases monitoring attention and may slow release cadence.
Budget below 20% (Red): Feature freeze. All engineering effort goes toward reliability improvements. Postmortems are mandatory for any new incidents. Leadership visibility increases.
Budget exhausted (Critical): Complete change freeze except for critical fixes. A war room is established to address the root causes. No new deployments until the service is back within SLO and the team understands what went wrong.
The Policy Must Have Teeth
According to Google’s SRE Workbook, if a single incident consumes more than 20% of the error budget, a postmortem is mandatory with at least one high-priority action item. Without enforcement, error budget policies become suggestions that teams ignore under deadline pressure.
Common Mistakes to Avoid
Too many SLOs. Start with 3-5 SLOs that cover critical user journeys. Fifty SLOs means no one pays attention to any of them.
Too aggressive targets. A 99.99% SLO sounds impressive, but if your historical performance is 99.5%, you’ll spend all your time in budget deficit. Set achievable targets and tighten them as you improve.
Internal metrics instead of user experience. CPU utilization and pod restarts matter to operators, but they don’t directly reflect user experience. Focus on what users actually see.
No action on budget burn. An SLO without a budget policy is just monitoring. Define what happens at each threshold and stick to it.
Measuring over wrong windows. A 24-hour window is too sensitive to normal variance. A 90-day window is too slow to respond to real problems. 28-day rolling windows balance signal with actionability.
Implementation Path
The implementation is iterative, not waterfall. Start by identifying your top three critical user journeys and defining SLIs for each. Use your existing observability stack to measure current performance over the past 28 days. Set initial SLOs at or slightly below current performance.
Build dashboards that show current SLI values, error budget remaining, and burn rate. Burn rate is crucial — knowing that you have 60% budget remaining is less actionable than knowing you’re burning at 3x normal rate and will exhaust budget in four days.
Set up alerts based on burn rate, not just SLI values. A brief dip below SLO might not matter if budget is healthy. A slow burn over days matters more than a spike that recovers quickly.
Document your error budget policy with specific thresholds and actions. Get buy-in from product and engineering leadership. Then enforce it — the first time you freeze features due to budget exhaustion, the organization learns that SLOs are real.
Review monthly. Did you meet targets? Were there incidents? What did you learn? Should targets be tightened because you’re hitting them easily, or loosened because they’re causing constant friction?
The Bottom Line
SLOs aren’t about achieving perfection — they’re about making reliability a first-class engineering concern with explicit trade-offs. When you set a 99.9% SLO, you’re acknowledging that 0.1% unreliability is acceptable and focusing your energy on staying within that budget rather than chasing an impossible zero.
The error budget is what makes this work. It turns reliability from an open-ended obligation into a measurable resource. It gives product and engineering a shared language for discussing risk. And it creates a self-regulating system where teams naturally balance feature velocity with stability.
Start small. Pick your most critical service. Define one availability SLI and one latency SLI. Set achievable targets. Build the dashboard. Write the policy. Then iterate.
References
- Google SRE: Embracing Risk
- Google SRE Workbook: Error Budget Policy
- Google: SLO Adoption and Usage Survey
- Google Cloud: Understanding Error Budget Overspend
- Uptime Calculator: SLA Downtime Reference
- Harness: How to Use Error Budgets
- incident.io: SLOs, SLAs, and SLIs Guide
Part of our SRE Practices series. Next: Implementing On-Call that Doesn’t Suck.