I have been working with a customer lately, helping them put together a set of security policies. We were making good progress until we got to disaster recovery. That is when things got a little… weird. Our first draft plan said that you should have all these other plans. But what do those plans say? That has led to our current effort. We want to build a solid disaster recovery plan. Here is what we have so far.
Keep Your Disaster Recovery Plan Flexible
There are disasters, and then there are disasters. (Side note: is 2020 not the year we all became disaster experts?) As well, there are plans that work for large organizations. And plans that work for a small organization. This will explain why there is no “one size fits all” disaster recovery plan.
Start with disasters. What kind of disaster could you imagine happening?
- Active shooter incident
- Terrorist action
What disaster is most common in your area? Use that as the organizing point for your disaster recovery plan.
Now, think about what would happen if your organization could not operate for an extended time. Who would be affected? How? When you understand the potential impact of your organization not being there, you will have a better sense of what needs to be returned to a working state. You will also have a better notion of how quickly things need to return to “normal.” These topics will define the objectives for your disaster recovery plan.
We have talked about potential impact. Now, we want to look at potential resources. How many people in our organization have a role in mitigating a disaster? It does not help to create seventeen project teams if you are the leader for each of them. Think also about what resources you can enlist to help with disaster recovery. This will include your support partners, your service providers, possibly peer organizations and perhaps government agencies. Now you have a clearer picture of the resources you will include in your disaster recovery plan.
The Big Questions: RTO and RPO
And let’s not forget R2-D2.
Here is where organizations get stuck. RTO—Restore Time Objective—specifies how much time should elapse before you have restored the organization’s systems and services. Does the organization want to be up and running again in a week? A day? Longer? Shorter?
RTO’s annoying twin, RPO—Restore Point Objective—specifies how many systems and services the organization wants to see restored by a given time. (Partial credit is also allowed: the organization can say that they would like to see some of a given system or service restored.)
It is not easy to specify these objectives, for two reasons.
- We are not used to thinking about how quickly we would like to return to work. Even more so, we are not used to thinking about what we really need to get our work done.
- There is no wrong answer. There are simply answers that cost more to implement and answers that cost less.
Use These Suggestions to Get Your Disaster Recovery Plan Going
Allow me to suggest that you want your Restore Time Objective to be one week or less. (You might set a faster RTO for smaller disasters.) I can say (with no supporting evidence) that even in large disasters life begins to return to normal within a week.
What about a Restore Point Objective? Cloud services should be recoverable within a day or two. I can say the same for email. It might take most of a week to recover any systems and services that are based onsite or in a colocation facility.
As for data, I will guess that you can get your work done if you have access to the last month’s worth of data.
Now that you have RTO and RPO specified, you have a sense of what investment you need to create standby services or systems. Be prepared to have a conversation with the CFO about the expense involved to make your disaster recovery plan happen. Expect to revisit the RTO and RPO targets.
React, Respond, Recover, Restart
(H/T to Suki O’Kane for sharing this framework with me.) Now we talk about the “doing” part of your disaster recovery plan.
In fact, before we get to that, gather some information. Do you have contact information (email, mobile, chat) for all your employees? Do you have a “go” kit with contact details (including website links) for your service providers? Also do you know how to reach First Responders? Finally, do you have a phone tree or some alerting system that can quickly notify all employees of an emergency?
If the disaster presents health and safety issues, start there. You need a “fast reaction” team that can escort people to safety, sweep the building for stranded staff, and take a census of who has been moved to safety. You need someone who will coordinate with police, fire, etc.
After the immediate threat has cleared, you need to respond. You will want to include IT staff and service providers on this team. Your first step will be to see what is working and what is not. Can you access the Internet? Are you able to ping your servers? Are cloud services operational? Will staff be able to work from home for now?
Take the answers to these questions, together with the RTO and RPO targets you have already worked out. Use these inputs to define your response and (possibly separate) recovery plan. You might activate the Azure Site Recovery setup that you created, so that applications and data can be restarted. Perhaps you need to activate a data restore to a temporary server or cloud instance. You might distribute some “loaner” laptops or mobile phones to staff who need them for now.
Congratulations. You have restored critical services, or at least provided a working solution for providing those services. Your recovery team will be focused on restoring services that are now being handled in some “workaround” fashion. This team will overlap quite a bit with the Response team. It may include some support providers that were not needed before. For instance, you may need to bring in someone to build a new server and connect it to the network.
Once you have recovered all systems and services from the disaster, it is time ask, “what can we learn from this experience? How should we change our disaster recovery plan to account for these lessons learned?” Are there steps the organization can take that would reduce the impact of a similar disaster in the future? I know a few customers that are accelerating their adoption of ACH payments, as one example. And we have seen plenty of organizations respond to the pandemic by accelerating the migration of applications to the cloud.
Communications is the Glue
The Communications team is different from the earlier ones in that it operates through all phases of the disaster and response. Staff, stakeholders, peer organizations, even regulatory bodies all need to know what is happening. You must communicate status (where are we?) and plans (what are we doing?) frequently to assure everyone that things are going to be alright. You might think, “I can handle this on my own.” That is true, but it certainly helps to offload the communications to another team while you focus on discovering and fixing problems.
This is a Test…
Your disaster recovery plan looks great. You have secured commitments to participate in the different teams from the right people and partners. Now you must answer this question: will it work? This is when you run a “tabletop exercise.” Gather the teams, outline the disaster scenario, and have the teams describe their response. Undoubtedly, you will uncover areas where responsibilities are unclear, or handoffs from one team to another have not been defined. Make the necessary changes to your disaster recovery plan. And feel better knowing that if the worst happens, you will be as prepared as you can be for it.