Disaster Recovery Plan (DRP)


A disaster recovery plan (DRP) is a documented and resourced plan showing actions and responsibilities to be used in response to critical incidents.

  • ensures that systems can recover from catastrophic events in a reasonable amount of time with minimal data loss
  • critical incidents
    • incidents that threaten the performance or security of a whole site
  • subset of Business Continuity Plan (BCP)
  • should accomplish the following:
    • Identify scenarios for natural and non-natural disasters and options for protecting systems
    • Identify tasks, resources, and responsibilities for responding to a disaster
      • tasks:
        • switching services to failover systems or sites
        • restoring systems and data from backups
    • Train staff in the disaster planning procedures and how to react well to adverse events
  • E.g., evacuation routes posted on maps, signage indicating meeting places in case of evacuation

BC/DR Toolkit

BC/DR toolkit is a container that holds all the necessary documentation and tools to conduct a proper BC/DR response action.

  • should be secure, durable, and compact
  • container may be physical or virtual
    • hard copies or electronic copies
    • recommend both versions
  • have duplicate in at least one other location
  • Contents:
    • current copy of the plan
      • with all appendices and addenda
    • emergency and backup communication equipment
    • copies of all appropriate network and infrastructure diagrams and architecture
    • copies of all requisite software for creating a clean build of the critical systems
      • with media containing appropriate updates and patches for the current versioning
    • emergency contact information
    • documentation tools and equipment
    • emergency essentials
      • flashlight, water, rations, etc.
    • fresh batteries for operating all equipment in the kit for at least 24 hours
  • need to maintain the kit

Relocation

  • organization may choose to evacuate and relocate personnel involved in critical operations to an alternate operating location
  • relocation plan components:
    • tasking and activities should include representatives from HR and finance
      • requires travel arrangements and payments
    • sufficient support for relocating dependents and family members
    • distance of relocation needs to balance cost
    • joint operating agreements and MoAs can be used to establish cost-effective relocation sites and facilities belonging to other operations in the local area

Power

  • interruptions to normal power supply often result from events or disasters
  • near-term emergency power is usually battery backups
    • UPS systems
    • failover should be close to immediate
      • have appropriate line conditioning so that transition does not adversely affect the powered devices
    • line conditioner function in a UPS often serves as an additional component of normal operations
      • dampens surges and dips in utility power automatically
  • generators that supply power when utility electricity is interrupted have automatic transfer switches
    • transfer switches
      • sense when the utility provision fails
      • start the generator
      • provide generator power to the facility
    • not a viable replacement for a UPS
      • should be used in conjunction
    • fuel is typically gasoline, diesel, natural gas, or propane
      • appropriate storage, supply, and maintenance should be documented in the BC/DR plan
      • fuel is a health and human safety hazard
        • should have appropriate precautions and safety measures
      • Uptime Institute recommends
        • 12 hours worth of fuel for data centers critical functions
        • additional fuel should be scheduled and performed within those 12 hrs
    • anticipate 72 hours of generator operation before alternatives are available

Testing

  • crucial to test system resilience and incident response effectiveness
    • benefits:
      • identify potential vulnerabilities
      • evaluate efficiency of recovery strategies
      • improve overall preparedness
    • tabletop exercises
      • involve teams discussing and working through hypothetical scenarios to assess their response plans and decision-making processes
      • help identify knowledge, communication, and coordination gaps
      • e.g., tabletop exercise might be an earthquake that destroys processing ability at a primary site, testing failover to an alternate processing location
    • validation tests/dry run
      • involve performing simulations of failovers
      • tests:
        • that services can be restored using backup configurations and data
        • metrics for recovery time
      • can reveal any unexpected problems,
        • such as dependencies between services not being met during the failover process
    • Full test
      • entire organization takes part in an unscheduled, unannounced practice scenario performing BC/DR activities
      • includes facility evacuation and system failover
      • best for detecting shortcomings in plan
      • has greatest impact to productivity