By the
2008 AASCIF Information Technology Standing Committee
A
functional Disaster Recovery (DR) site is an integral part of effective
IT management. Nearly every IT department today has some sort of
contingency plan in effect, ranging from merely keeping off-site
backups all the way to a fully replicated “hot site” that
can run all of a company’s IT operations in case of an emergency.
Regardless of the nature of the DR site, regular testing is an
absolutely vital part of any good DR plan.
New Mexico Mutual recently completed a full test of our DR site,
a site not quite yet a fully replicated “hot site.” About
80% of our company’s IT functions are available there, which
made the test a fairly wide-ranging one. Here are 7 key things
we learned:
1. Set realistic expectations.
If you are relying on 3rd party vendors for any part of your plan,
you simply can’t control things 100%. In our case, it was
the phone vendor (who shall remain nameless) who failed to divert
phone traffic to the DR site in a timely manner. In addition,
there will be idle time for the business people involved in the
drill, so set that expectation up front as well. And since the
test involved re-routing network and phone functionality to the
DR site, it was also important to let those not participating
in the drill know what they should expect in terms of the availability
(or lack thereof) of those services.
2. Things are going to go wrong. And that’s a good
thing.
In fact, that is the entire reason for testing, to find the problems
now rather than finding them for the first time during an actual
emergency. Even things that seem like a failure at the time are
really a success, because you caught them in a test and not in
the sheer chaos of an actual emergency.
3. Have people dedicated to observing the drill and writing down
their observations.
It’s asking too much of the people actually doing the drill
to expect that they do this, too. We had 3 people charged with
writing things down as they occurred, and the information was invaluable
in the post-test debrief. Which leads us to...
4. Do a post-mortem as soon as is feasible after the drill itself.
It is crucially important to get everyone together after the drill
and compare notes. It was from this session that we gathered
and disseminated the lessons learned, and formed the basis of
the tweaks we made to the overall DR plan.
5. Get senior management involved.
We were very fortunate in that our entire senior management team
took the time to attend the drill. Their participation was absolutely
crucial in terms of ensuring participation from the business
side. With senior management clearly invested in, and engaged
with, the DR testing process, we were able to test all aspects
of our operations with the enthusiastic support of the business
unit owners.
6. Be prepared to modify your plan based on the results of the
test.
A famous military dictum says that no battle plan survives contact
with the enemy; in other words, no matter how good the plan is,
forces beyond your control are going to dictate changes to it.
Like #2 above, this is also a good thing as you don’t want
to try and change your plan on the fly in the midst of a real emergency.
An important takeaway from our test was that many systems that
IT classified as “non-critical” were, in fact, critical.
With the buy-in from senior management, this meant more resources
at our disposal and a plan that grew in scope.
7. Stay positive.
I know this sounds trite, but even a simulated emergency can cause
tempers to flare and people to get cross with one another. One
of the upshots of senior management involvement was that the
IT department felt very much under pressure to deliver on the
things contained in the DR plan. Staying positive and focused
on the fact that the objective of the test is to find out (with
apologies to Donald Rumsfeld) the things “you don’t
know you don’t know” will ensure a successful DR
test all the way around!
|