High Availability


High availability (HA) is ensuring systems remain operational and accessible with minimal downtime.

  • involves designing and implementing infrastructure for fault tolerance and redundancy
  • concept of availability can be measured over a defined period (e.g. one year)
    • can be measured as:
      • Uptime
        • the amount of time client data and resources are available on the servers
      • Downtime
        • time or percentage that a system is unavailable
        • maximum tolerable downtime (MTD) metric expresses the availability requirement for a particular business function
        • is calculated from the sum of scheduled service intervals plus unplanned outages over the period
    • usually loosely described as
      • 24x7
      • 24x365
    • Availability is often measured in the number of nines (including the whole number) found in a percentage
      • E.g., 99.999% uptime, it is stated as “five nines”
Nines ValueAvailabilityAnnual Downtime (hh:mm:ss)
Six99.9999%00:00:32
Five99.999%00:05:15
Four99.99%00:52:34
Three99.9%08:45:36
Two99%87:36:00

Scalability and Elasticity

Fault Tolerance and Redundancy

Fault tolerance is protection against system failure by providing extra (redundant) capacity.

  • fault tolerant systems identify and eliminate single points of failure

Redundancy is overprovisioning resources at the component, host, and/or site level so that there is failover to a working instance in the event of a problem.

Site Considerations

Cloud as Disaster Recovery (DR)

Testing Redundancy and High Availability

  • Load testing
    • incorporates specialized software tools to
      • validate a system’s performance under expected or peak loads
      • and identify bottlenecks or scalability issues
  • Failover testing
    • focuses on validating failover processes to ensure a seamless transition between primary and secondary infrastructure
  • Testing monitoring systems
    • validate effective detection and response to failures and performance issues