Rate Limiting: Algorithms, Tradeoffs, and Design

Rate limiting caps how many requests a client can make in a time window to protect services from abuse, overload, and runaway costs. Core algorithms are token bucket, leaky bucket, fixed window, and sliding window. Distributed rate limiting typically uses Redis. Exceeded limits return HTTP 429 Too Many Requests with a Retry-After header.

Why Rate Limit

Rate limiting restricts the number of requests a user, IP, API key, or tenant can make in a given period. It protects backends from overload, defends against DoS and brute-force attacks, enforces fair usage across tenants, and controls cost (e.g., capping expensive LLM or third-party API calls).

When a limit is exceeded, servers respond with HTTP 429 Too Many Requests, usually with a Retry-After header and rate-limit headers (X-RateLimit-Limit, X-RateLimit-Remaining, X-RateLimit-Reset) so clients can back off. GitHub, Stripe, and Twitter/X all expose such headers.

Core Algorithms

Each algorithm trades burst tolerance against accuracy and memory. Token and leaky bucket shape flow rate; window-based counters bound counts per interval.

Token bucket: a bucket holds up to B tokens, refilled at rate R per second; each request consumes a token, requests with no token are rejected. Allows short bursts up to B while enforcing average rate R. Used by Stripe and AWS API Gateway.
Leaky bucket: requests enter a queue (bucket) and are processed at a fixed rate; overflow is dropped. Smooths bursts into a steady output but adds queueing delay.
Fixed window counter: count requests per fixed interval (e.g., 100/minute), reset each window. Simple but allows a 2x burst at the window boundary (90 requests at 0:59 + 100 at 1:00).
Sliding window log: store timestamps of each request and count those within the trailing window. Accurate but memory-heavy.
Sliding window counter: approximates the sliding log using weighted counts of the current and previous fixed windows, balancing accuracy and efficiency. Used by Cloudflare.

Algorithm Tradeoffs

For most APIs, token bucket is the pragmatic default: it allows reasonable bursts while bounding the long-run average, with O(1) state per client. Sliding window counter is the go-to when you need tighter accuracy without storing every timestamp.

Algorithm	Bursts	Accuracy	Memory
Token bucket	Allows up to B	Good	O(1) per key
Leaky bucket	Smooths out	Good	O(queue)
Fixed window	Boundary spike	Loose	O(1)
Sliding log	Strict	Exact	O(requests)
Sliding counter	Mostly strict	Approximate	O(1)

Distributed Rate Limiting

On a single server, an in-memory counter suffices. Across a fleet behind a load balancer, you need a shared store so the limit is global. Redis is the standard: atomic INCR with EXPIRE for fixed windows, or Lua scripts / the redis-cell module (token bucket via the CL.THROTTLE command) for atomicity without race conditions.

Challenges include the latency of a Redis round-trip on every request (mitigated by local approximations and periodic sync), clock skew across nodes, and hot keys for popular tenants. Many systems rate limit at the API gateway or edge (Kong, Envoy, Cloudflare, NGINX limit_req) so abusive traffic is dropped before reaching application servers.

ResuMax tailors your resume to each role, scores it like a recruiter, and preps you for interviews.

Practice with the interview coach

Frequently asked questions

Token bucket vs leaky bucket: what's the difference?

Token bucket allows bursts up to the bucket size while capping the average rate, accepting requests as long as tokens remain. Leaky bucket processes requests at a strictly constant rate via a queue, smoothing bursts but adding delay. Token bucket is more common for APIs.

Why does the fixed window algorithm allow bursts?

Because the counter resets at fixed boundaries, a client can send the full quota at the end of one window and again at the start of the next, effectively 2x the limit in a short span around the boundary. Sliding window algorithms fix this.

How do you rate limit across many servers?

Use a shared store like Redis with atomic operations (INCR + EXPIRE, Lua scripts, or redis-cell) so all servers see one global counter. To reduce per-request latency, combine local token buckets with periodic synchronization, or enforce limits at the gateway/edge.

What HTTP status code is used for rate limiting?

HTTP 429 Too Many Requests, typically with a Retry-After header telling the client how long to wait, plus X-RateLimit-Limit/Remaining/Reset headers to communicate the quota state.

All system design