Table of contents

The GDS Way and its content is intended for internal use by the GDS community.

How to manage alerts

Your service should have a system in place to send automated alerts if its monitoring system detects a problem. Sending alerts help services meet service level agreements (SLAs).

When to send alerts

Your service should send an alert when your service monitoring detects an issue that:

  • affects service users
  • requires action to fix
  • lasts for a sustained period of time

You should only send an alert for things that need action. This includes immediate action for critical or high priority alerts, or less urgent alerts. Alert text should be specific and include actionable information. You should not include sensitive material in this text.

Create alerts based on events affecting users, such as slow service requests or unavailable webpages. You should also track events like disks being full or almost full, or databases being down. These are both events that can lead to more serious alerts that do affect users.

You must set up your alerts to distinguish between when something is wrong for a sustained period of time, rather than a brief issue caused by temporary network conditions. Prometheus supports this by using the for parameter in its alerting rule, which indicates the condition has to be true for a set period of time before it triggers an alert.

If no action is needed, it should not be an alert. For example, an alert showing a system’s status is really a monitoring tool and you should use a dashboard to display this information instead. The GDS Way has more guidance on how to monitor your service.

Specific examples of issues that should not trigger an alert include an individual container instance dying or a single task invocation failing. This is because the systems running the architecture like Amazon Elastic Container Service (Amazon ECS) or Kubernetes will bring the instance back up, and the task will retry.

Situations where there’s a long period of fewer than expected instances or repeated task failures should trigger alerts, as they show underlying problems.

How to prioritise alerts

You must prioritise alerts based on whether they need an immediate fix. It can help to class issues as:

  • interrupting - need immediate investigate and resolution
  • non-interrupting - do not need immediate resolution

The Google Site Reliability Engineering (SRE) handbook classifies “interrupting” issues as “pages”, and “non-interrupting” issues as “tickets”. Put non-interrupting alerts into a ticket queue for your support team to solve. Keep the ticket queue and team backlog separate to avoid confusion. You should specify an SLA for how long both types of alert take to resolve.

You should manage alerts using a dedicated tool which will allow you to:

  • manage alerts in a queue
  • triage alerts for review
  • attach statuses to alerts, for example, in progress or resolved
  • gather data about alerts to help improve processes

Recommended tools are:

  • PagerDuty to send high-priority / interrupting alerts
  • Zendesk to manage non-interrupting alerts as tickets

You can also configure these tools to send alert notifications using email or Slack. However, you should only use email and Slack as additions to your primary alerting tool. If alerts only go to email or Slack, people may ignore, overlook, filter them out, or treat them like spam.

We recommend using dashboards and information radiators, also known as Big Visible Charts (BVC), like Smashing or BlinkenJS in combination with alert management tools.

Further reading

For more information refer to the:

This page was last reviewed on 4 December 2018. It needs to be reviewed again on 4 June 2019 by the page owner #gds-way .
This page was set to be reviewed before 4 June 2019 by the page owner #gds-way. This might mean the content is out of date.