Latency

Reliability

Also known as: response time, delay

Latency is the time delay between a user action (or request) and the system's response, typically measured in milliseconds. It is a critical non-functional requirement in system design that directly impacts user experience.

Latency encompasses the entire journey of a request: network transit time (client to server and back), processing time (server handling the request), and any waiting time (queuing, lock contention). Each hop in a distributed system adds latency.

Common latency benchmarks: L1 cache reference (~1ns), RAM access (~100ns), SSD read (~100μs), network round trip within a data center (~0.5ms), cross-continent round trip (~150ms). Understanding these orders of magnitude helps make informed design decisions.

Strategies to reduce latency include caching (serve from memory instead of disk/network), CDNs (serve from geographically closer servers), connection pooling (avoid repeated handshake costs), async processing (return immediately, process in background), and data locality (co-locate data with compute).

In system design interviews, latency requirements (e.g., "p99 latency under 200ms") drive key architectural decisions. Tail latency (p95, p99) matters more than average latency because even rare slow responses degrade user experience.

Related Terms

Ready to design?

Practice using latency in a real system design on Supaboard's interactive whiteboard.

Browse Challenges