Rate Limiting

Rate limiting controls the number of requests a client can make to a service within a given time window, protecting systems from abuse, preventing resource exhaustion, and ensuring fair usage across users.

Rate limiting is a defensive mechanism placed at API gateways, load balancers, or application endpoints. When a client exceeds the allowed request rate, the system responds with HTTP 429 (Too Many Requests) and typically includes a Retry-After header.

Common algorithms include fixed window (count requests per fixed time interval), sliding window (smoother, counts requests over a rolling window), token bucket (tokens replenish at a fixed rate, each request consumes a token), and leaky bucket (requests processed at a constant rate, excess queued or dropped).

In distributed systems, rate limiting requires shared state (how many requests has this user made across all servers?). Solutions include centralized counters in Redis, distributed rate limiting with eventual consistency, or client-side rate limiting.

Rate limiting is essential in system design for protecting against DDoS attacks, preventing a single tenant from monopolizing shared resources, managing costs for expensive operations (e.g., AI API calls), and enforcing API usage tiers.

Related Terms

Practice This Concept

Ready to design?