The GOV.UK Service Manual describes how you should design your service to maximise uptime. However, it is inevitable that your service will experience unscheduled downtime at some point, which could have a serious impact on users. You must plan your response for when this happens.
Planning for disaster recovery
In digital services, a disaster is a significant event that either disrupts your service or completely prevents you from providing it. A disaster can have various causes, including:
- a cloud provider suffering a serious regional outage
- accidental or malicious destruction
A disaster recovery plan details how you will recover from these serious incidents.
Understand risks and threats to your service
You should also work with risk and service owners to plan for the worst-case scenarios. This is particularly important for your data, as loss or theft of data is disastrous for most services.
Make sure you consider how disruption to or unavailability of your dependencies can affect your service. These dependencies can include obvious systems, such as your hosting provider, but also less obvious systems such as your:
- Domain Name System (DNS) provider
- Content Delivery Network (CDN)
- any Virtual Private Networks (VPNs) your development team might use to access production systems build pipelines
- external library and container registries (for example, RubyGems, PyPI or Docker Hub)
- secrets management providers
Back up your data
You should back up your data. How often you back up will depend on your Recovery Point Objective (RPO). An RPO determines what you consider an acceptable loss of data between the last backup and the point of data loss. For example, if you determine that up to 24 hours of data loss is an acceptable risk, you will only need to backup your data once daily. For lower RPOs (minutes or seconds), you should use continuous backups and point-in-time recovery services that most modern cloud providers offer.
Have cold or offline backups
A cold or offline backup is a copy of your backup, isolated from the rest of your production system. Offline backups help protect you from some disaster scenarios. If an attacker gains access to your live database service and your backups, you could experience a ransomware attack, or risk having your entire data estate deleted. You should implement offline backups so that even users with administrative access cannot alter or delete data.
Offline backups also help protect your data from accidental deletion. For example, copy your backups to a different account with limited access rights. Make sure:
- the backups are encrypted and versioned
- the account you are copying the backups from only has write permissions
Decide where to store your backups
Where you are storing your backups is also important for disaster recovery. Most cloud storage solutions have durability guarantees. However, an outage could occur in the region containing your live service and backups. If your service has very high availability requirements, you may need copies of your data in a different region, or with a different service provider. Your conversations with IA and service owners can help you determine your risk tolerance for these hopefully rare scenarios.
The National Cyber Security Centre (NCSC) article Offline backups in an online world further explains why offline backups are important and suggests some approaches to use in a cloud environment.
Decide on manual or automatic recovery in the event of a disaster
Setting a Recovery Time Objective (RTO) for your service can help you decide your approach to service recovery in the event of a disaster. For example, Digital Marketplace is built to automatically recover in various scenarios, including an outage in an availability zone. It is not built to withstand a serious outage of GOV.UK Platform as a Service (PaaS) or the Amazon Web Services (AWS) region the PaaS depends upon. However, Digital Marketplace can restore its service with another provider and/or in another region manually by redeploying its applications using infrastructure as code and databases from backups. Therefore, in the event of this hopefully low-risk scenario, it has set its RTO as one week.
In this example, the services Digital Marketplace depends upon can probably restore service in less than one week. Therefore, the RTO is applicable in the worst-case scenario.
An RTO of one week is not likely to be acceptable for services where there would be a serious user impact if they were unavailable. For very low RTOs (for example, under one hour), you will need to architect your service to support automatic disaster recovery. For example, GovWifi operates in two AWS regions - Dublin and London. In the event of a serious outage in London, essential aspects of the GovWifi service will still operate from the Dublin region.
Your disaster recovery strategy will depend on your specific service requirements. You may need different strategies for different parts of your service. The more immediate and more automatic the strategy, the greater the complexity and the cost of running your service is likely to be. The Plan for Disaster Recovery article from AWS covers some common recovery patterns.
Agree your RPOs and RTOs with risk and service owners
Make sure you discuss and agree your RPOs and RTOs with risk and service owners. Not only is their expertise key to decision making, but it’s vital they understand what will happen and why in the event of a disaster.
Test your disaster recovery plans
Disasters rarely happen, meaning that your disaster recovery plans are likely to become ineffective unless you test them regularly.
You must regularly test that you can restore data from your live and offline backups. You must test the procedure for restoring an offline backup so team members are familiar with the procedure for accessing and using the backup.
You must regularly test your manual or automatic recovery processes so that you are confident they will work in an emergency situation. This also helps team members build familiarity with the relevant procedures.
You should organise game days to practise incident management and test disaster recovery procedures. Game days enable you to play out scenarios in a safe environment so you can test and learn from how you respond. These scenarios will help build familiarity and increase confidence that your emergency procedures will work when you most need them to.
You can also practice incidents or recovery from disaster scenarios in a more low-pressure way. The GOV.UK PaaS team regularly spin the Wheel of Misfortune to help new and existing team members gain experience and confidence working with the platform.