Skip to main content
Table of contents

The GDS Way and its content is intended for internal use by the GDS community.

Make data-driven decisions with service level objectives

Service level objectives (SLOs) set out a target level for your service’s reliability. Using SLOs helps you make data-driven decisions about the opportunity cost of reliability work.

You can use service level objectives (SLOs) and error budgets alongside service level indicators (SLIs) to help you make better decisions.

SLOs help you decide how to prioritise reliability work, such as bug fixes and architectural improvements, by stating the problem in a user-centric way. For example, ‘this bug affects 10% of users’ or ‘this will provide a faster experience for the users by 500ms’.

What you’ll need to know before setting your SLOs

Current SLI levels

Current performance based on SLIs is usually a good place to start, especially if you do not have any other information. It also helps to set a baseline that you can improve to reflect service objectives.

Your dependencies

You’ll need to know the SLOs for other services that your service relies on. For example, your hosting provider or any other components that are important to your users’ experience.

User satisfaction

You’ll need to be aware of current user satisfaction levels. SLOs give users an expectation about the level of reliability they can expect from your service. Users are more likely to complain or stop using the service if it drops below the target level.

If your users are happy with the availability and latency of your service you can set your SLOs based on your current SLIs.

In other cases, your user expectations will drive a set of SLOs the service’s current architecture may not achieve. You can reduce the gap between expectations and your SLO by using techniques like ratcheting.

Setting your first SLOs

Use the SLIs as a reference

Your SLIs will give you a good idea of how your current platform is doing and give you a baseline to reference to.

Choose a time window

You should report SLOs over a set period of time. For most purposes, a 4-week rolling window works well.

A shorter time window lets you make decisions and iterate development more quickly, such as prioritising bug fixes. You could also set your time window over a longer period of time. This can be better for more strategic decisions and can align with your business calendar, such as in quarterly planning.

Choose a SLO Target

You should consider setting an SLO target with technical, product and business implications.

Although your current performance (SLIs) should be taken into account, you should not set an SLO target based on the current performance to lock you into supporting the system with overachieving goals. Balance the SLO target with your product and business objectives.

Create error budgets

Error budgets help you know when to prioritise reliability work and when to prioritise new features. They provide an absolute value to a process and help determine when you need to take action.

You can work out the error budgets from your SLOs. For example, if your SLO on availability is 99.9%, your error budget is 0.1% of the request volume. For example, if your service had 1 million requests, your error budget will be 1,000. This works out at 40 minutes of outage in your 4-week rolling window.

An error budget is usually expressed as a percentage rather than an absolute value. This can help you to focus on what the error means to your users. For example, if your service has 435 errors during the SLO period, it would have used up:

435/1000x100% = 43.5%

of its error budget

If your error rate, which is the number of bad requests in this case, is less than 1,000 you’ve met your error budget.

SLO burn rate and alerts

This image is a way of visualising your error budget against your error rate threshold. It tells you whether a service is experiencing abnormally high incident or error rates, and whether you need to take urgent action.

alt_text

SLO burn rate is how fast, relative to the SLO, the service consumes the error budget. An alerting window is defined to capture errors accumulated within this period before alerting.

Team should capture both sudden, fast SLO burn, which may be caused by an incident or bug; and gradual, slow SLO burn, which may be caused by system degradation and scalability issues.

Using error budgets to create policy

You can use error budgets to create policies, which are the actions you need to take when your service has used up, or nearly used up, its error budget for the period.

For example, your policy could state that you:

  • stop launching features until the SLO is met again
  • devote 80% of developer time to reliability-related bug fixing

SLO and error budgets can also help data-driven decision making.

This table is an example of how you can use a combination of your SLO and user satisfaction to decide what action to take.

SLOs Toil Customer satisfaction Action
Met Low High Choose to (a) relax release and deployment processes and increase velocity, or (b) step back from the engagement and focus engineering time on services that need more reliability
Met Low Low Tighten SLO
Met High High If alerting is generating false positives, reduce sensitivity. Otherwise, temporarily loosen the SLOs (or offload toil) and fix product and/or improve automated fault mitigation
Met High Low Tighten SLO
Missed Low High Loosen SLO
Missed Low Low Increase alerting sensitivity
Missed High High Loosen SLO
Missed High Low Offload toil and fix product and/or improve automated fault mitigation

Table credit: https://landing.google.com/sre/books/

Your team should agree on the SLOs and error budget policies. It’s useful to get agreement from the:

  • product manager
  • delivery manager
  • technical architect
  • developers
  • Tech Ops and SRE team members

Continuous improvement of SLOs

You can use your user research and outage records to improve your SLOs. Eventually you’ll be able to set your SLOs by user experience. For example, if your users are happy with the service’s performance, you can experiment with relaxing your SLOs and measure the resulting user satisfaction.

Further reading

This page was last reviewed on 5 March 2020. It needs to be reviewed again on 5 September 2020 by the page owner #gds-way .
This page was set to be reviewed before 5 September 2020 by the page owner #gds-way. This might mean the content is out of date.