Availability Monitoring
An availability monitor triggers an alert or alarm if a host or service experiences an outage or other unscheduled downtime.
- referred to as heartbeat monitors or uptime monitors
- work by sending a probe to the target service and checking for a non-error response
- e.g., HTTP service should return a 200 status code (OK) when available
- some monitors can check the expiry date of digital certificates
Troubleshooting Unresponsive Services
- unresponsive services usually manifest with multiple clients unable to connect
- common underlying causes:
- application or OS hosting the service has crashed
- or there is a hardware or power problem
- server hosting the service is overloaded
- high CPU/memory/disk I/O utilization/disk space utilization
- throttling client connections until the server resources can be upgraded
- There is congestion in the network, either at the client or server end (or both)
- Use
pingortracerouteto- check the latency experienced over the link
- compare to a network performance baseline
- throttling connections or bandwidth may help to ease the congestion
- Use
- broadcast storm is causing loss of network bandwidth
- Switching loops causes broadcast and unknown unicast frames to circulate the network perpetually
- may quickly consume all link bandwidth and crash network appliances
- check for excessive CPU utilization on switches and hosts
- Spanning Tree Protocol (STP) is supposed to prevent such loops
- can fail if STP communications between switches do not work correctly
- either because of
- a fault in cabling
- or a port/transceiver or because of a misconfiguration
- either because of
- can fail if STP communications between switches do not work correctly
- Ports can also be configured with storm control
- will start to drop broadcasts and unknown unicasts if they reach a certain level
- Network congestion or high host CPU/memory utilization
- may be a sign that the service is being subject to a denial of service (DoS) attack
- Look for unusual access patterns
- e.g., use GeoIP to graph source IP addresses by country and compare to baseline access patterns
- application or OS hosting the service has crashed
Info
- if LAN cannot connect to an external host
- use site like isitdownrightnow.com to test whether issue is local network or service provider