Fault Tolerance and Redundancy


Fault Tolerance

A fault is usually defined as an event that causes a service to become unavailable.

A fault tolerant system is one that can experience failures in individual components and subsystems and continue to provide the same (or similar) level of service

  • often achieved by provisioning redundancy for critical components to eliminate single points of failure
  • MTBF/MTTF/MTTR KPIs
    • are used to:
      • measure the reliability and efficiency of systems, processes, and equipment
      • assess whether goals for MTD, RTO, and RPO can be met
      • guide decisions regarding
        • system design
        • maintenance practices
        • and redundancy or failover requirements
          - important to risk management processes
      • provide measurable insights into potential risks and supporting risk mitigation strategies

Mean Time Between Failures

Mean time between failures (MTBF) is a metric for a device or component that predicts the expected time between failures.

  • represents the expected lifetime of a product
  • calculated as
  • e.g.,
    • 10 appliances that run for 50 hrs, 2 fail
  • higher MTBF suggests greater reliability and longer intervals between failures
    • can affect
      • maintenance scheduling
      • spare part management
      • and overall system performance

Mean Time to Failure

  • mean time to failure (MTTF)
    • similar metric for non-repairable components
    • e.g., hard drive could have MTTF, server would be MTBF
    • calculated as
    • e.g.,
      • two drives installed in a RAID array, one failed after 10 years, but never replaced, the other failed after 14 yrs bringing the server down

Mean Time to Repair

Mean time to repair (MTTR) is a measure of the time taken to correct a fault so that the system is restored to full operation.

  • aka mean time to replace or recover
  • calculated as
  • can be used to estimate whether a RTO is achievable
  • lower MTTR indicates quicker restoration of functionality
    • reducing downtime and potential disruptions to operations
  • helps allocate resources, prioritize maintenance activities, and optimize repair processes

Redundancy

A redundant or failover component is one that is not essential to the normal function of a system but allows the system to recover from the failure of another component.

  • devices that provide fault tolerance:
    • redundant spares
      • Components such as power supplies, network cards, drives (RAID), and cooling fans provide protection against hardware failures
      • fully redundant server configuration is configured with multiple components for each function
      • faulty component will automatically failover to the working one
    • network links
      • if multiple paths between switches and routers, these automatically failover to a working path if a cable or network port is damaged
    • uninterruptible power supplies (UPS) and standby power supplies
      • provide power protection in the event of complete power failure and other building power issues
    • backup strategies
      • provide protection for data
    • cluster services
      • means of ensuring that total failure of a server does not disrupt services generally