Disaster Recovery Plan (DRP)
A disaster recovery plan (DRP) is a documented and resourced plan showing actions and responsibilities to be used in response to critical incidents.
- ensures that systems can recover from catastrophic events in a reasonable amount of time with minimal data loss
- critical incidents
- incidents that threaten the performance or security of a whole site
- subset of Business Continuity Plan (BCP)
- should accomplish the following:
- Identify scenarios for natural and non-natural disasters and options for protecting systems
- Identify tasks, resources, and responsibilities for responding to a disaster
- tasks:
- switching services to failover systems or sites
- restoring systems and data from backups
- tasks:
- Train staff in the disaster planning procedures and how to react well to adverse events
- E.g., evacuation routes posted on maps, signage indicating meeting places in case of evacuation
BC/DR Toolkit
BC/DR toolkit is a container that holds all the necessary documentation and tools to conduct a proper BC/DR response action.
- should be secure, durable, and compact
- container may be physical or virtual
- hard copies or electronic copies
- recommend both versions
- have duplicate in at least one other location
- Contents:
- current copy of the plan
- with all appendices and addenda
- emergency and backup communication equipment
- copies of all appropriate network and infrastructure diagrams and architecture
- copies of all requisite software for creating a clean build of the critical systems
- with media containing appropriate updates and patches for the current versioning
- emergency contact information
- documentation tools and equipment
- emergency essentials
- flashlight, water, rations, etc.
- fresh batteries for operating all equipment in the kit for at least 24 hours
- current copy of the plan
- need to maintain the kit
Relocation
- organization may choose to evacuate and relocate personnel involved in critical operations to an alternate operating location
- relocation plan components:
- tasking and activities should include representatives from HR and finance
- requires travel arrangements and payments
- sufficient support for relocating dependents and family members
- distance of relocation needs to balance cost
- joint operating agreements and MoAs can be used to establish cost-effective relocation sites and facilities belonging to other operations in the local area
- tasking and activities should include representatives from HR and finance
Power
- interruptions to normal power supply often result from events or disasters
- near-term emergency power is usually battery backups
- UPS systems
- failover should be close to immediate
- have appropriate line conditioning so that transition does not adversely affect the powered devices
- line conditioner function in a UPS often serves as an additional component of normal operations
- dampens surges and dips in utility power automatically
- generators that supply power when utility electricity is interrupted have automatic transfer switches
- transfer switches
- sense when the utility provision fails
- start the generator
- provide generator power to the facility
- not a viable replacement for a UPS
- should be used in conjunction
- fuel is typically gasoline, diesel, natural gas, or propane
- appropriate storage, supply, and maintenance should be documented in the BC/DR plan
- fuel is a health and human safety hazard
- should have appropriate precautions and safety measures
- Uptime Institute recommends
- 12 hours worth of fuel for data centers critical functions
- additional fuel should be scheduled and performed within those 12 hrs
- anticipate 72 hours of generator operation before alternatives are available
- transfer switches
Testing
- crucial to test system resilience and incident response effectiveness
- benefits:
- identify potential vulnerabilities
- evaluate efficiency of recovery strategies
- improve overall preparedness
- tabletop exercises
- involve teams discussing and working through hypothetical scenarios to assess their response plans and decision-making processes
- help identify knowledge, communication, and coordination gaps
- e.g., tabletop exercise might be an earthquake that destroys processing ability at a primary site, testing failover to an alternate processing location
- validation tests/dry run
- involve performing simulations of failovers
- tests:
- that services can be restored using backup configurations and data
- metrics for recovery time
- can reveal any unexpected problems,
- such as dependencies between services not being met during the failover process
- Full test
- entire organization takes part in an unscheduled, unannounced practice scenario performing BC/DR activities
- includes facility evacuation and system failover
- best for detecting shortcomings in plan
- has greatest impact to productivity
- benefits: