Techrepublic Resource Guide: Disaster Planning and Recovery
Techrepublic Resource Guide: Disaster Planning and Recovery
Infrastructure expert Rick Schiesser discusses the key steps involved in implementing a disaster
recovery strategy.
Debra Littlejohn Shinder, MCSE, MVP, outlines the essential elements that belong in your
disaster readiness plan.
From network connectivity & storage to systems performance & documentation, Rick Vanover
highlights critical considerations for developing comprehensive disaster recovery procedures.
Your disaster recovery plan can’t be exclusive to technology systems. You have to take human
nature into account also. Here Mike Talon underscores this vital aspect of disaster planning.
Use this handy guide when developing your own crisis communications policy.
Sponsored by:
Page 1 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
During the development of these applications, we began initial discussions about developing a disaster
recovery plan for these AS/400s and their critical applications. In February 1995, the effort got a major
jumpstart from an unlikely source. A distribution transformer that powered the AS/400 computer room
from outside the building short-circuited and exploded. The damage was so extensive that repairs were
estimated to take up to five days. With no formal recovery plan yet in place, IT personnel, suppliers, and
customers all scurried to minimize the impact of the outage.
With the help of one of the company’s key vendors, we quickly identified and activated a makeshift
disaster recovery site located 40 miles away. Within 24 hours, the studio’s AS/400 operating systems,
application software, and databases were all restored and operational. This makeshift solution met most
of the critical needs of the AS/400 customers during the six days that it eventually took to replace the
failed transformer.
Three important lessons learned
This incident accelerated the development of a formal disaster recovery plan. It also underscored three
important points about recovering from a disaster. The first point is that there are noteworthy
differences between the concept of disaster recovery and that of business resumption. I’m defining
business resumption as the ability to perform critical department processes as soon as possible after the
initial outage. Full recovery from the disaster usually occurs many days after the start of the business
resumption process.
In this case, we restored most of the company operations affected by the outage in less than a day after
the transformer exploded. It took nearly four days to replace all the damaged electrical equipment and
another two days to restore operations back to their normal state. Distinguishing between these two
concepts helped during the planning process for the formal disaster recovery program—it let us focus
on business resumption in meetings with key customers, while we focused on disaster recovery with key
suppliers.
The second point is that most computer center outages are caused by relatively small, localized
incidents like broken water mains, fires, smoke damage, or electrical equipment failures—not the flash
floods, powerful hurricanes, or devastating earthquakes frequently highlighted in the media.
This isn’t to say that you shouldn’t be prepared for such a major disaster. Infrastructures that plan and
test recovery strategies for smaller incidents are usually well on their way to developing a program to
handle any size of calamity. While major calamities do occur, they are far less likely and are often
Page 2 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
overshadowed by the more widespread effects of the disaster on the community. What usually makes a
localized computer center disaster so challenging is that the rest of the company is normally operational
and desperately in need of the computer center services that have been disrupted.
The third point is that this extended outage prompted executive management to make a firm
commitment to a formal disaster recovery plan. In many ways disaster recovery is like an insurance
policy: You don’t really need it until you really need it. This commitment was the first important step
toward developing an effective disaster recovery process. A comprehensive program requires hardware,
software, budget, and the time and efforts of knowledgeable personnel. The support of executive
management is necessary to make these resources available.
Steps to developing an effective disaster recovery process
Another reason this support is important is that managers are typically the first to be notified when a
disaster occurs. This sets off a chain of events involving management decisions about deploying the IT
recovery team, declaring an emergency to the disaster recovery service provider, notifying facilities and
physical security, and taking whatever emergency preparedness actions may be necessary. By involving
management early in the design process, you secure their emotional and financial buy-in, thus
increasing the likelihood that management will understand and fulfill its role in the disaster recovery
process.
The executive sponsor has several other responsibilities. One is selecting a process owner. Another is
getting the support of other managers to ensure that participants are properly chosen and committed to
the program. These other managers may be direct reports, peers within IT, or, in the case of facilities,
outside of IT. Finally, the executive sponsor needs to demonstrate ongoing support by requesting and
reviewing frequent progress reports, offering suggestions for improvement, questioning unclear
elements of the plan, and resolving issues of conflict.
7. Choose participants and clarify their roles for the recovery team.
The cross-functional team chooses the individuals who will participate in the recovery activities after any
declared disaster. The recovery team may be similar to the cross-functional team but should not be
identical. Additional members should include the executive sponsor, key customer representatives, and
representatives from any outside service providers. Once the recovery team is selected, it’s imperative
that each individual’s role and responsibility be clearly defined, documented, and communicated.
Page 4 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
distribution. Documentation of the plan must also include up-to-date configuration diagrams of the
hardware, software, and network components involved in the recovery.
Nightmare incidents
During many years of managing and consulting on IT infrastructures, I’ve encountered a number of
nightmarish disaster recovery incidents. Some are humorous, some are “head-scratching,” and some are
just plain bizarre. In all cases, they totally undermined what would have been a successful recovery from
either a real or simulated disaster. Fortunately, no single client or employer with whom I was associated
ever experienced more than any two of these, but in their eyes, even one was unacceptable. These
incidents, listed below, illustrate how critical planning, preparation, and performance are to a good
disaster recovery:
The first four incidents all involve the handling of the backup tapes required to restore copies of data
rendered inaccessible or damaged by a disaster. Verifying that the backup and, more importantly, the
restore process are completing successfully should be one of the first requirements of any disaster
recovery program. While most shops verify the backup portion of the process, more than a handful of
Page 5 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
shops don’t test to verify that the restore process also works. Labels and locations can also cause
problems when tapes are marked or stored improperly.
Although a rare case, I do know of a client who was unable to retrieve a tape because the offsite tape
storage supplier hadn’t been paid in months. Fortunately, it was not during a critical recovery.
Communication to, documentation of, and training of all shifts on the proper recovery procedures are
critical. Third-shift graveyard operators often receive the least of these due to their off hours and
higher-than-normal turnover. These operators need to know whom to call and how to contact offsite
recovery services.
Classified environments can present their own brand of recovery nightmares. One of our classified
clients had applied for a security clearance for its offsite tape storage supplier and had begun using the
service prior to the clearance being granted. When the client’s military customer found out, the tapes
were confiscated. In a related issue, a separate defense contractor cleared its offsite vendor to a
secured program but failed to clear the one individual who worked nights when a tape was requested
for retrieval. The unclassified worker couldn’t retrieve the classified tape that night, delaying the
restoration of the data for at least a day.
The last two incidents involve tape canisters used during a full dry-run test of restoring and running
critical applications at a remote hot site 3,000 miles away. The airline in question had just changed its
carry-on baggage policy, which meant the recovery team couldn’t keep the tape canisters with them.
Making matters worse was the fact that the canisters were mislabeled, which cost over six hours of
restore time. There was much to talk about during the marathon postmortem session that followed this
incident.
Page 6 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
Your company’s response to a disaster will depend on both the nature and the extent of the disaster.
Some threats, such as a tornado or flood, may physically destroy your IT infrastructure. Others, such as
pandemic disease, affect human resources while leaving buildings and machinery intact. A
cyberterrorism attack might bring down your network but not affect the functionality of the hardware
or your personnel. A bombing may destroy both human life and network components. A power outage
could render your equipment unusable, but do no lasting damage. Thus your plan should cover
contingencies for as many threat types as possible.
2. Areas of responsibility
A key component in any crisis management situation—which is what you have during and perhaps
immediately after the disaster—is assignment of areas of responsibility and establishment of a chain of
command. This is no time to have department heads squabbling about who has decision-making
authority. And remember that some types of disasters may result in loss of personnel (or some of your
staff may be on vacation or out sick when the event occurs), so be sure to assign alternates in case some
of the important players are not available.
Training of key personnel in disaster preparedness, incident management, and recovery should also be
addressed.
Your plan should include up-to-date contact information on people and entities that may need to be
contacted when a disaster occurred. This is no time to be scrambling for phone numbers. Information
should be included for both internal personnel (CEO, CIO, legal advisor, etc.) and external personnel and
services (police, fire, ambulance, security services, utility companies, building maintenance, etc.).
4. Recovery teams
It will take teamwork to manage the crisis itself and to put things back together once the immediate
crisis is over. The BCP should appoint members of a disaster recovery team (DRT) made up of specialists
with training and knowledge to handle various aspects of common disasters (safety specialist, IT
specialist, communications specialist, security specialist, personnel specialist, etc.). The DRT members
will work with emergency services during the disaster and should have access to equipment they’ll need
during an emergency (cell phones, flash lights, hard hats, protective clothing, etc.).
A business recovery team is responsible for reestablishment of normal operations after the crisis is over.
Page 7 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
Any good business continuity plan will address restoration of your company’s important digital data if it
is destroyed. Too many organizations meticulously make backups of everything and then store those
backups in the server room. If a tornado, flood, or bomb destroys the building, that (often irreplaceable)
data is gone, too.
You should store copies of important data on removable media that’s kept at a different physical
location or back it up over the Internet to a remote server, or both. Just as important, key personnel
should know where it’s stored and have the keys, passwords, etc., to be able to restore it to get users
back to a productive state as soon as possible.
Many types of physical disasters can result in a loss of electrical power, or a power outage can, itself, be
the disaster. For continuity of business, your organization should plan for what to do in case of a long-
term outage (more than the hour or less that your uninterruptible power supplies will keep your
computers and network equipment running).
If you have backup generators in place, ensure that key personnel know how to switch to generator
power and know the fuel requirements for the generators (must they be fueled or do they run off the
natural gas line?), among other practical issues. Consider cost factors to determine when and for how
long the generators should be run. Providing full electrical power to a building with a generator can cost
much more than using the power grid, so the BCP should discuss in what situations it’s better to close
down operations and send everyone home rather than run on generator power, and it should define
who has the authority to make that decision.
If your company’s phones and/or Internet connection are down, how will you keep in touch with
customers, employees who are off-site, contact emergency services, etc.? Your BCP should note which
employees have cell phones and their numbers, as well as whether and where you have other methods
of communicating during a widespread disaster, such as ham radios. If you run your own e-mail servers,
do key employees have alternative e-mail addresses that they check regularly (home accounts or
accounts with Web-based e-mail services, etc.) and are these addresses known to other key personnel in
case they’re needed for emergency contact?
The BCP should also spell out a plan for setting up operations at an alternative location if the building is
destroyed or rendered unusable by a disaster. Best practice is to have ready access to an empty facility
that you can move into; a more practical (less expensive) alternative would be to move your operations
to a branch office if you have more than one physical site.
Page 8 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
The BCP should also take into consideration the estimated costs of moving, setup, and ongoing
operations in the new facility.
In some cases, you may be able to recover essential equipment and move it to a new site. In others, it
may be destroyed or damaged and have to be replaced or repaired. The BCP should lay out how the
equipment or its functions will be replaced (for instance, you may switch to a Web hosting or e-mail
hosting service until you’re able to replace your servers and get them operational again).
The BCP should address the step-by-step process of recovering and reinstating the business operations
to a pre-disaster state, including assessing the damage, estimating recovery costs, working with
insurance companies, monitoring the progress of the recovery process, and transitioning the
management of the business operations from the recovery team back to the regular managers.
Page 9 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
We all have to address disaster recovery (DR) at various levels, but we typically must apply the
technology to fit rigid parameters—such as less cost or functionality—instead of being able to do it right.
But what if you didn't have any limitations to hold you back? How would you create the perfect DR
model? Here are some things (however unrealistic) that might go into building the perfect environment
for meeting DR requirements.
Providing transparent network connectivity is our number one challenge in making the ideal DR
environment. If subnets for data center components were designed to be available across multiple
locations without reliance on one piece in another data center, DR failover would be a breeze. Sure, a lot
can be done to manage the use of a DR site through DNS and virtual switches—but if those could be
avoided for a more natural configuration, the process could be made easier and work in a more
transparent fashion.
Storage could arguably take the #1 spot on our list, since it's such a big pain in DR configurations.
Technologies are available to handle storage replication and set up storage grids, but how many of us
have the money to implement the functionality? The ideal DR storage system would also dispel any
performance limitations when you're running the entire enterprise from the DR configuration.
Limitations in performance may cause a selective DR, which makes for difficult decisions on what
systems are truly required in the DR environment.
How many times have you come across a system that began as a pilot or simple test, was promoted to a
live role, and is singular in nature and can't scale? These are DR plan inhibitors. If all systems are
designed with the DR concepts in mind, all systems can comply with the same DR requirements and be
an easy transition for administrators.
This extends to the peripheral components as well—storage, data recovery, networking, and access to
the system should be created with DR in mind. But too many times, a system may have some but not all
of the DR components in place. "Mostly compliant" with the DR model is still noncompliant.
Have you ever been irritated by partial compliance with an enterprise DR policy? An example would be
when one application meets a different standard of DR—so maybe only a few clients can run the
application in the DR configuration. Wouldn't it be great if the standing policy for the organization was
Page 10 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
to have full compatibility for the DR configuration? The ideal DR policy would provide funding and
enforce the requirements for the DR configuration across all systems and groups within IT.
How we get to a solid and robust DR configuration will vary widely by size and scope, but the perfect
conversion to the DR would a quick and contained process that is identified in a few steps per system, or
a few steps for the entire environment. With the DR configuration so accessible, this would also be a
good opportunity to enforce regular intervals where the DR configuration is used.
An overly complex procedure to use a DR site can ruin the usability of the mechanism. The ideal DR
environment has consistent and clear documentation that is practiced regularly so there's no guessing in
switching to the DR model. In fact, regular use of the DR model can ensure that the remote DR site
works as expected, keeps staff familiar with the procedure, and extends the life of primary systems by
increasing idle time at the primary site.
The most challenging part of DR is the data recovery process. If a data recovery model is patched
together using various scripts, watchdog programs, or other solutions that are not native to a product’s
feature set, the risk of data corruption and DR failure goes up. The ideal DR model would have solutions
built into the product that consider all parts of a solution, as many products use more than just a
database to provide the overall application.
A comprehensive DR plan that meets all requirements from a design perspective yet can't handle the
load is worthless. You don't want to have to decide which applications and systems are available at the
DR site when you're in a DR situation. Limitations such as Internet connectivity, network bandwidth,
shared storage throughput, backup mechanism availability, and storage capacity are all factors in
gauging the overall performance for the DR site.
The perfect DR situation would be an exact inventory in the remote data center that models that of the
primary data center. However, maintaining an equipment inventory in lockstep with another data center
is nearly impossible. So the next-best solution would be a remote data center that meets or exceeds a
performance benchmark set by the primary data center in all relevant categories.
9. The user experience in the change-over is nothing more than a reboot (if that)
Managing the transition to the remote data center is difficult enough on the data center. But the user
side of the transition should be made as seamless as possible. Strong DR plans and mechanisms
frequently base technology on DNS names (especially CNAME records) that can be easily switched to
Page 11 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
reflect a new authoritative source for the business service. This can include standby application servers
and mirrored database servers, as well as migration to new versions with the simple DNS change.
Managing the refresh or the caching of the names can be a little tricky, but either having clients reboot
or run the ipconfig /flushdns command on Windows clients can usually refresh any caching. The same
goes for server systems that are affected by a DR transition; they may need to refresh their own DNS
cache, so the same configuration steps may need to be followed on the server platform.
10. All things are possible for the small environment, too
The more robust DR configurations tend to present themselves naturally to the large enterprise.
However, the small IT shops are at a resource disadvantage when it comes to architecting a
comprehensive DR plan. The ideal DR model would be applicable to big and small environments, and all
of the objectives could be reached with the small organization. Technologies such as virtualization have
really been a boon for the small environment to achieve their DR objectives, and that frequently is the
cost justifier for the initial investments in storage and management software.
Page 12 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
Many organizations view disaster recovery planning as primarily a technology discussion and forget to
take human nature into account. One of my clients went so far as to build a plan that relied on paying
someone to grab disk drives prior to leaving the building in the event of a disaster. The company
obviously left human nature out of its equation, and the entire DR plan was in serious jeopardy because
of it.
When planning DR solutions, companies must remember that the technology will only supplement the
efforts of humans in one way or another—not replace them. Whether the DR plan will keep systems
online for client access (such as Web sites, ATMs, and intranets) or keep back-office solutions online, the
main goal is to keep data available for people to use.
Human interaction impacts a DR plan on several different levels, and the IT staff is the most influential
part of the DR plan. These are the people who must bring the operations back online during an
emergency and keep them running. Never underestimate the role they play, even in automated DR
solutions.
Not everything will work smoothly, due to misconfiguration or simple problems. When things go wrong,
it will be up to your staff to fix the problems, as quickly as humanly possible.
When creating a DR plan, companies must be realistic, particularly about the people the plan involves.
The people who set up the systems may not be available to perform failover operations. They may have
left the organization, they may be unable to reach the systems in question, or they may be dead.
As horrible as the thought is, part of planning for disasters is planning for the worst-case scenario. While
rare, large-scale disasters such as earthquakes can cause fatalities. You must plan for as many
contingencies as you can—and hope that you haven't missed the one that actually happens when the
disaster strikes.
Internal end users are another important factor in a DR plan. Bringing data systems back online will be a
useless effort if no one can use them. Remember that you're planning for a multitude of possibilities,
from power outages to building loss.
Most of these disasters will cause end-user desktops and access points to fail for one reason or another.
While you can bring the data systems back online in another facility, you'll also need to find ways to get
your end users up and running again.
VPN systems, alternate workspace, and other methodologies can help mitigate this issue, but you must
plan for these options and set them up ahead of time. You also need to test them on a regular basis, and
this means bringing end users into the testing process. Once again, the human element becomes a huge
part of your DR planning.
Finally, don't forget that there's a good chance that the ultimate end users are not internal employees.
There could easily be a large portion of data consumers who exist beyond the corporate firewall.
Page 13 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
So not only must you set up alternate access for your internal concerns, but you must also be ready to
reroute incoming and outgoing Internet connectivity as well. This may require DNS changes, additional
connectivity links, and even additional security constructs such as signing certificates.
Nearly every Internet user expects occasional outages, but make sure that even if you can't get the
original data center back online in a reasonable amount of time, you have some place for these clients
to connect to within a short timeframe.
Never underestimate the impact of the human factor in planning for disaster recovery. Ignoring this
element is a sure route to failure, and one that you can avoid by remembering that it still takes a human
being to plug in a machine.
Page 14 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
Page 15 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html
TechRepublic Resource Guide: Disaster Planning and Recovery
Customer service
In-house customer service operations will resume at the hot site or cold site as soon as possible. As
stated above, managers only are to report to the hot site within the first 24 hours or to the cold site
after it is properly equipped. Customer service representatives will prioritize and address only the most
critical calls as designated by their managers. For other common calls and requests, customer service
representatives will send e-mails to customers that explain the current crisis and advise them on
resolving common issues.
Documentation
Managers will have hard copies of all relevant documentation at an off-site location to ensure proper
resumption of business activities as soon as possible. Managers will transport documentation to the hot
site/cold site when directed to do so by unit vice presidents.
Signed: Date:
Disclaimer: This policy is not a substitute for legal advice. If you have legal questions related to this policy, see your lawyer.
Page 16 of 16
Copyright ©2008 CBS Interactive, Inc. All rights reserved.
For more downloads and a free TechRepublic membership, please visit http://techrepublic.com.com/2001-6240-0.html