Heartbeat
ReliabilityA heartbeat is a periodic signal sent between distributed system components to indicate they are alive and functioning, used for failure detection and triggering recovery mechanisms when signals stop.
Heartbeats are the most common failure detection mechanism in distributed systems. A component periodically sends a small message ("I'm alive") to a monitor or peer. If the expected heartbeat is not received within a timeout period, the sender is presumed dead.
Heartbeat design involves trade-offs: shorter intervals detect failures faster but generate more network traffic. Longer intervals reduce overhead but delay failure detection. Most systems use intervals of 1-10 seconds with a failure threshold of 2-3 missed heartbeats.
Heartbeats are used everywhere in distributed systems: load balancers health-checking backend servers, cluster managers monitoring worker nodes, database replicas confirming the leader is alive, and service mesh sidecars reporting pod health.
False positives (declaring a healthy node dead due to network delay) are a common challenge. Techniques like adaptive timeouts, gossip protocols, and phi accrual failure detectors help reduce false positives in large-scale systems.
Related Terms
Ready to design?
Practice using heartbeat in a real system design on Supaboard's interactive whiteboard.
Browse Challenges