Disaster Recovery Testing Best Practices

Disaster Recovery Testing Best Practices

Disaster has worked its way into our everyday lives. Whether it’s climate disaster, extreme weather, human error or cybersecurity concerns, individuals and businesses have to learn how to navigate around interruptions and outages instead of simply hoping the worst-case scenario won’t happen.

According to the World Economic Forum, some of the top global risks for 2023 that will impact society over the next 2-year period include natural disasters & extreme weather events, large-scale environmental damage incidents, and widespread cybercrime and cyber insecurity, among other disaster-related topics. All of these risks are also projected to have an ongoing impact in the decade to come.

The pandemic has had long-reaching effects, and increased disaster frequency is no exception. Remote working conditions and economic insecurity have contributed to an increase in cyberattacks. In the face of these natural and man-made challenges, businesses need to ensure their data, applications, and workloads are protected. The best way to do this is through disaster recovery testing – ensuring your disaster recovery plan will go off without a hitch.

What is a Disaster Recovery Plan?

A disaster recovery plan is often a component of a larger business continuity plan that adds structure to the company’s response to various types of disaster scenarios, as it relates to IT systems and infrastructure. Having a plan allows you to respond and recover quickly when everyone is running around in panic mode.

Disaster recovery planning is falling squarely into IT’s domain as their responsibility is to keep the business’ information systems up and running. The metrics used when planning are primarily Recovery Time and Recovery Point. Disaster recovery plan success is all about how well you meet those targets when the chips (and your systems) are down.

  • Recovery Time Objective (RTO): The targeted length of time from failure to restoration of business systems and services after a disaster.
  • Recovery Point Objective (RPO): The maximum amount of data loss the business deems acceptable following a disaster or failure.

In contrast to disaster recovery planning, the business continuity plan often involves decision-makers from other departments, including legal and public relations. The business continuity plan will answer questions such as: What is the company’s communication strategy to customers and the market in the event of a data breach? How do I continue business operations after a disaster event? How do I communicate with employees after a disaster?

Of course, there will be overlap, so it’s vital for IT and decision-makers from other functional groups to create a disaster recovery and business continuity plan that is aligned on all fronts.

Why should you perform disaster recovery testing? And when should you do it?

Imagine being in an actual disaster and your IT recovery plan falls flat. What could happen? Without disaster recovery rehearsal drills, you may end up with displaced workers, irreplaceable data loss, broken trust with clients, or even unsafe working conditions that could place someone in harm’s way. Testing your disaster recovery (DR) plans can ensure that when a disaster does occur, you’re not caught off-guard.

How often you test your DR plan will depend on how vulnerable your business is to different disasters, what mission-critical workloads you want to be sure to test, and how many scenarios you want to run. You may want to run lighter tests, like procedural walkthroughs, quarterly, while a full simulation might be conducted once a year. You may also want to run disaster recovery rehearsal drills more often right after you’ve built your plan to ensure it works as intended, and test it again after making adjustments. Let’s be honest – there’s no such thing as being too prepared.

What are the types of disaster recovery testing a business can perform?

When it comes to testing, there are three main ways you can determine whether your disaster recovery plan is up to snuff: Bubble (isolated) rehearsal, non-isolated rehearsal, and live failover testing.

Isolated rehearsal

Also known as bubble testing or controlled testing, isolated rehearsal is done in a sandbox separate from the production environment. Multiple networks can also be within the test environment – it’s considered a bubble when there’s no connectivity with the internet or anything on the outside.

Once an isolated rehearsal has launched, customers are given connectivity to login through methods like VPN and will confirm that each server boots up in the test environment, plus that the data has been restored in a way that meets their recovery point objectives (RPO) and recovery time objectives (RTO). Depending on their environment, a business may also be able to test all their applications in a bubble. However, if they’re not able to due to connectivity requirements, they may need to do a non-isolated rehearsal.

Non-isolated rehearsal

A non-isolated rehearsal offers the second tier of disaster recovery testing that businesses might need for applications and functionality relies on additional connectivity, such as to the internet or a SQL server database. This testing is still done with the goal of not affecting production, but a hole is opened into the bubble to allow for some kind of connection, oftentimes to simulate how a customer might be going into the environment.

TierPoint’s disaster recovery solutions include running an annual isolated rehearsal in a non-live environment. Companies looking for non-isolated rehearsals often opt for running them as additional tests, but this type of rehearsal is not as common and often requires discussion beforehand to determine whether it can be done.

Live failover

For organizations that want to test what a total shift from production to a recovery site would look like, complete with a total reversion back to the production site after a few days, a live failover is the way to go. This kind of test most closely resembles what an actual disaster looks like. A live failover allows a customer to completely recover on a separate site, operate there for several days, and reverse back to the production site afterwards. There are two types of live tests organizations can do. If they’re changing IP addressing during a live failover, they can do partial testing with applications or subsets of servers. If they’re not changing IP addressing, everything needs to be included in the failover.

Depending on industry and compliance requirements, the type of disaster recovery test required of a business will vary. To meet needs for cyber insurance, a bubble test is often sufficient. However, for financial and insurance industries, there may be a greater need to perform a non-isolated rehearsal or a live failover. It’s also not unusual for a business to ask about additional disaster recovery testing options after performing an initial bubble test.

Why is disaster recovery testing important?

What’s your business tolerance for downtime? Organizations need to think about the cost of downtime not just in terms of revenue and hours, but also in terms of ability to service customers and reputation.

Being ready for any scenario that might arise means that you can keep running business as usual, or as close to usual as you can get. Improving your business continuity will help you build internal and external trust, retain business, keep employees safe and productive, and keep your revenue flowing with minimal hiccups.

If you end up facing an unexpected disaster, without DR testing, recovery can be rough. It may even be so detrimental to your business that you never fully recover. A successful rehearsal would include identifying problems that need to be fixed before an actual disaster occurs. Prevention and preparation are key here.

Types of disasters and their impact on your business

Disasters can be natural, man-made, external, or internal – understanding where they come from and how they can negatively impact your business is the first step to making a solid disaster recovery plan that can be tested against real scenarios.

Natural disasters

Tornadoes, hurricanes, earthquakes, and fires are all natural disasters that have been on the rise with the increase in extreme weather events we’ve seen in recent years. Because many of these disasters put physical infrastructure in danger, including data centers and office buildings, it’s important to make plans to safeguard your data and your workers. Hosting your data in an area with a lower likelihood of extreme weather, having a backup site in a geographically distinct location from your main site, enabling remote or offsite working for employees should natural disasters impact your headquarters – these are all factors that should be considered.

Pandemics and other biological disasters

We’ve seen first-hand how a global pandemic changed the way we work. Some changes that have come about – an increase in remote work, an increased need for cybersecurity measures – will never fall back down to pre-pandemic levels. Businesses that were able to adapt quickly, enabling their workforce to be remote and embrace video conferencing, for example, fared better early on. How will your business remain flexible in the face of future global health crises?

Man-made disasters

Even though there are malicious actors who can orchestrate an attack on your data, what is often just as dangerous to your business is an accidental action from a well-intentioned person on the inside. You need to be ready for both.

Hackers might infiltrate your organization through social engineering schemes, like spear phishing. They might send malware or perform a brute force attack on your systems. They could quietly infiltrate and later hold your data hostage for ransom.

Unfortunately, you don’t only have to worry about the “bad actors.” There are plenty of opportunities for well-meaning internal team members to create disasters. This could look like sending a reply to the wrong email address, neglecting to set up multi-factor authentication, misplacing a key card, or accidentally deleting data that isn’t currently being backed up. Without testing for both purposeful and accidental man-made disasters, it’s easy for an “oops” to be missed.

5 steps and best practices for disaster recovery testing

Embarking on DR testing can feel like a lot to take on, but by breaking it down into steps and revisiting your plan, it becomes more manageable.

Step 1: Map out every scenario you can imagine

The first step is the most involved, but the most important. The more time you spend on envisioning every disaster scenario that could befall your business, the more you’ll uncover pieces that need to be tested. Different parts of your business will cease to function if a tornado hits than what will happen in a ransomware attack, for example.

Step 2: Schedule your testing frequencies

Decide which types of tests should be run and at what frequencies. Choose a time annually to run a full rehearsal (at a minimum). Placing these tests on your calendar will also help you see what training programs you may need to run organization-wide to prepare for full rehearsal, or what onboarding pieces need to be added.

Step 3: Take detailed notes on your tests

Once you determine the frequency of the tests and run them, document your successes, failures, and lessons learned. What didn’t turn out as planned? What results look different from the defined standards? What are you going to do between now and the next DR test to ensure it goes better? Which departments need to be better prepared in a full rehearsal? Document all of your observations.

Step 4: Be clear about what “good” looks like

It’s great to strive for perfection, but it’s also reasonable to accept that certain results from your disaster recovery test may fall within an acceptable margin and still be considered good outcomes. Your notes should include an evaluation of whether what happened in the tests fell within allowable ranges. If your recovery point objective is restoring data that was submitted within the last 6 hours and you were able to recover things from 8 hours back, is that good? It’ll depend on your defined standards.

Step 5: Refine and test as needed

Review your notes and adjust as necessary for your next rehearsal. Decide whether you need to test more frequently for a while to be sure you can meet your goals.

Disaster recovery testing factors to keep in mind

Along with the steps of your disaster recovery test, you’ll also want to consider the following factors and their importance to your overall disaster recovery plan.

Timing

Think about timing both in terms of when you run your tests as well as your recovery point objectives (RPO) and recovery time objectives (RTO).

The frequency of the tests should coincide with opportune times to conduct them. Do you experience seasonal surges in demand? Do you have periods of time when mission-critical workflows are even more critical? Plan your tests for before these periods, so that when it’s “go-time,” you haven’t left any stones unturned.

Recovery time objectives (RTO) outline how quickly you can restore your workloads, whereas recovery point objectives (RPO) define how far back the data you recover can be. If you run daily backups, your RPO would be up to 24 hours. If you can lose a lot of business in an hour or two, you’ll want your RTO to reflect this, especially during working/operational hours.

Changes

What changes do you have to make to accommodate different disasters? As you work on your disaster recovery plan and subsequent tests, you may find some changes that need to be made. What’s the game plan for implementing these refinements? At what point are people involved in making changes?

Impact

You might consider the impact in a few different ways:

  • The impact disasters of different types have on different parts of your business and for how long
  • The impact end users can have on mitigating disaster
  • How testing can work to minimize the impact of disasters on your business

People

When a disaster occurs, what types of stakeholders or users will be affected? Who will you have to notify, move, or safeguard? To be covered from all angles, who needs to be involved in disaster recovery tests at different levels?

Disaster recovery testing checklist

When it’s time for your disaster recovery testing, follow these steps to ensure success:

  • Establish RPOs and RTOs: Before making any kind of plan, a business needs to have well-established recovery point objectives (RPOs) and recovery time objectives (RTOs). How long can you go without having your applications up and running? Which pieces are most critical to your operations? How old can recovered data be to still prove useful to you? These numbers may vary by team, so establishing specific objectives is critical.
  • Create a comprehensive runbook: Without a runbook, there’s no rehearsal. Core pieces for the runbook should include the order servers need to come up in the recovery environment, critical people who can declare an emergency, notes on whether IP addresses are changing, and what the rehearsal will look like. Key responsibilities should be stated so that there’s a clear path forward and everyone knows who oversees what on either side.
  • Run the rehearsal: Once everything’s replicated, then what? Ensure the production environment is isolated from recovery. Start bringing up the servers in relevant boot orders specified by the business (or boot them all at once). The business will come in, start logging into servers, and verify data & applications are working as intended.
  • Put notes in the runbook: Report your findings into the runbook at the end of the drill. If something went wrong – there was a problem with an application, a server came up corrupt, etc. – document the failure and what to fix. If there isn’t an immediate fix, document the workaround in the runbook.
  • Take down the environment: If the test is in a bubble, bring it down and clean everything up. Walk away with your findings and run the test again next year.

Guarantee Disaster Recovery Success with TierPoint

Building a strong disaster recovery plan, and then testing that plan, requires extensive preparation and diligence. Missing one potential disaster scenario could stand in the way of your business continuity plans. At TierPoint, our disaster recovery as a service (DRaaS) offering, backup services, and other disaster recovery & business continuity services can help facilitate testing that ensures resiliency for your data, applications, and infrastructure.

Interested in learning more about what you can do to prevent and overcome potential business continuity challenges? Download The Ultimate Guide to Running Your Business Through Uncertainty and Disruption here.



More >> Disaster Recovery Testing Best Practices
Featured Data Centers