This document is current until 4 December 2018
Services at GDS need to be set up to send automated alerts to staff if the service’s monitoring detects issues.
Alerts can take many forms including emails, phone calls and status websites.
The service manual has some information on writing alerts.
User needs for alerting
- Support GDS’ goal of services being available to meet user needs
- Mitigate reputational damage to GDS
- Adhere to service level agreements (with reliant parties like other government departments), industry standards (like PCI compliance) and legal requirements (like the Code of Practice for Official Statistics)
Principles for setting up an alerting system
Alerts should be meaningful and actionable. An alert that shows the status of a system is actually a monitoring tool. The text of your alerts should be specific - people will be responding to alerts in the middle of the night.
Don’t rely on emails as an automated alerting method for incident responders. Emails become noisy which over time results in them being ignored or overlooked. If you absolutely have to use emails, template them so that the information is presented consistently.
Don’t include sensitive information in your alerts. Alerts are likely to be shared while they’re being worked on.
Dashboards or information radiators can be useful to show the current status of alerts, but they should not be the primary alerting mechanism. If you use them, ensure you have one reliable source rather than multiple dashboards.
People respond better to their preferred alerting method. Use a tool which allows your on-call staff to have the same alert presented in different ways depending on the responder.
The audience for your alerts should dictate which alerting mechanism to use. Sending an SMS to all of your users when you have an outage might be impractical, but setting up a status page is probably sensible.
Do end-to-end tests of your alerting pipeline regularly.
As a team grows or becomes more mature, it becomes more important to track alerts, escalate them (automatically or manually), acknowledge them (as work in progress) and co-ordinate rotas.
Tools for alerting
Your monitoring system should use PagerDuty to send alerts about problems with your service.
We recommend using Pingdom as well, as a top-level check on your service’s availability.
We recommend using Atlassian StatusPage to provide updates on service status to users.