This document is current until 1 March 2018
Teams should run regular smoke tests to ensure that services are available. Several teams at GDS have used Selenium for this.
Teams should use a tool to ensure user journeys are working as expected.
Metrics-based monitoring is useful - it works with virtual machines, PaaS and containers. Collecting metrics is useful for capacity planning and autoscaling.
Use Grafana for creating dashboards to view infrastructure and application metrics.
Application error monitoring
Multiple teams at GDS are using or evaluating Sentry for application error monitoring.
Teams should use configuration management to set up monitoring reproducibly.