Fault Tolerance and Redundancy

Fault Tolerance

A fault is usually defined as an event that causes a service to become unavailable.

A fault tolerant system is one that can experience failures in individual components and subsystems and continue to provide the same (or similar) level of service

often achieved by provisioning redundancy for critical components to eliminate single points of failure
MTBF/MTTF/MTTR KPIs
- are used to:
  - measure the reliability and efficiency of systems, processes, and equipment
  - assess whether goals for MTD, RTO, and RPO can be met
  - guide decisions regarding
    - system design
    - maintenance practices
    - and redundancy or failover requirements
      - important to risk management processes
  - provide measurable insights into potential risks and supporting risk mitigation strategies

Mean Time Between Failures

Mean time between failures (MTBF) is a metric for a device or component that predicts the expected time between failures.

represents the expected lifetime of a product
calculated as
e.g.,
- 10 appliances that run for 50 hrs, 2 fail
higher MTBF suggests greater reliability and longer intervals between failures
- can affect
  - maintenance scheduling
  - spare part management
  - and overall system performance

Mean Time to Failure

mean time to failure (MTTF)
- similar metric for non-repairable components
- e.g., hard drive could have MTTF, server would be MTBF
- calculated as
- e.g.,
  - two drives installed in a RAID array, one failed after 10 years, but never replaced, the other failed after 14 yrs bringing the server down

Mean Time to Repair

Mean time to repair (MTTR) is a measure of the time taken to correct a fault so that the system is restored to full operation.

aka mean time to replace or recover
calculated as
can be used to estimate whether a RTO is achievable
lower MTTR indicates quicker restoration of functionality
- reducing downtime and potential disruptions to operations
helps allocate resources, prioritize maintenance activities, and optimize repair processes

Redundancy

A redundant or failover component is one that is not essential to the normal function of a system but allows the system to recover from the failure of another component.

devices that provide fault tolerance:
- redundant spares
  - Components such as power supplies, network cards, drives (RAID), and cooling fans provide protection against hardware failures
  - fully redundant server configuration is configured with multiple components for each function
  - faulty component will automatically failover to the working one
- network links
  - if multiple paths between switches and routers, these automatically failover to a working path if a cable or network port is damaged
- uninterruptible power supplies (UPS) and standby power supplies
  - provide power protection in the event of complete power failure and other building power issues
- backup strategies
  - provide protection for data
- cluster services
  - means of ensuring that total failure of a server does not disrupt services generally

adam's notes

Table of Contents