Design a Task Queue
Design a distributed task queue for processing background jobs. Support delayed execution, retries with backoff, priority queues, dead letter queues, and exactly-once processing guarantees.
You'll practice
Functional Requirements
- Enqueue tasks with priority and optional delay
- Retry failed tasks with configurable policies
- Route permanently failed tasks to a dead letter queue
Non-Functional Requirements
- At-least-once delivery guarantee
- Sub-100ms scheduling delay for high-priority tasks
- Scale to 10K+ tasks per second
Frequently Asked Questions
What is the difference between at-least-once and exactly-once delivery?
At-least-once delivery guarantees every task is processed but may result in duplicates if a worker crashes after processing but before acknowledging. Exactly-once delivery ensures each task is processed only once, typically achieved through idempotency keys — the task itself is delivered at-least-once, but the side effects are deduplicated using a unique identifier.
How do you implement task retries with exponential backoff?
After a failure, delay the retry by an increasing interval: e.g., 1s, 2s, 4s, 8s, up to a maximum. Add jitter (random variation) to prevent thundering herds when many tasks fail simultaneously. Track retry count per task and route to a dead letter queue after exceeding the maximum retries.
What is a dead letter queue and when should you use one?
A dead letter queue (DLQ) stores tasks that have failed all retry attempts. Instead of losing failed tasks, the DLQ preserves them for debugging and manual intervention. Monitor DLQ depth as an operational metric — a growing DLQ indicates a systemic problem that needs attention beyond just retrying.
How do you implement priority queues in a distributed system?
Use separate physical queues per priority level (high, normal, low) and have workers poll high-priority queues first. Alternatively, use a single queue with priority-based sorting, though this is harder to scale. Ensure lower-priority tasks still make progress by reserving a percentage of worker capacity for each level (starvation prevention).
Ready to design this system?