Fault Tolerance and Redundancy
Fault Tolerance
A fault is usually defined as an event that causes a service to become unavailable.
A fault tolerant system is one that can experience failures in individual components and subsystems and continue to provide the same (or similar) level of service
- often achieved by provisioning redundancy for critical components to eliminate single points of failure
- MTBF/MTTF/MTTR KPIs
- are used to:
- measure the reliability and efficiency of systems, processes, and equipment
- assess whether goals for MTD, RTO, and RPO can be met
- guide decisions regarding
- system design
- maintenance practices
- and redundancy or failover requirements
- important to risk management processes
- provide measurable insights into potential risks and supporting risk mitigation strategies
- are used to:
Mean Time Between Failures
Mean time between failures (MTBF) is a metric for a device or component that predicts the expected time between failures.
- represents the expected lifetime of a product
- calculated as
- e.g.,
- 10 appliances that run for 50 hrs, 2 fail
- higher MTBF suggests greater reliability and longer intervals between failures
- can affect
- maintenance scheduling
- spare part management
- and overall system performance
- can affect
Mean Time to Failure
- mean time to failure (MTTF)
- similar metric for non-repairable components
- e.g., hard drive could have MTTF, server would be MTBF
- calculated as
- e.g.,
- two drives installed in a RAID array, one failed after 10 years, but never replaced, the other failed after 14 yrs bringing the server down
Mean Time to Repair
Mean time to repair (MTTR) is a measure of the time taken to correct a fault so that the system is restored to full operation.
- aka mean time to replace or recover
- calculated as
- can be used to estimate whether a RTO is achievable
- lower MTTR indicates quicker restoration of functionality
- reducing downtime and potential disruptions to operations
- helps allocate resources, prioritize maintenance activities, and optimize repair processes
Redundancy
A redundant or failover component is one that is not essential to the normal function of a system but allows the system to recover from the failure of another component.
- devices that provide fault tolerance:
- redundant spares
- Components such as power supplies, network cards, drives (RAID), and cooling fans provide protection against hardware failures
- fully redundant server configuration is configured with multiple components for each function
- faulty component will automatically failover to the working one
- network links
- if multiple paths between switches and routers, these automatically failover to a working path if a cable or network port is damaged
- uninterruptible power supplies (UPS) and standby power supplies
- provide power protection in the event of complete power failure and other building power issues
- backup strategies
- provide protection for data
- cluster services
- means of ensuring that total failure of a server does not disrupt services generally
- redundant spares