A Redis Job Queue Was Enough

#redis #infrastructure #distributed-systems #devops

When we started running embeddings, chatbot replies, certificate generation, and translation jobs out of the request path, the obvious choice — the one every architecture deck pushes — was Kafka. Durable log, replay, consumer groups, the whole story.

We didn’t pick it.

Numbers, not vibes

We measured. The peak across all background work was roughly 40 jobs per second, with a heavy long tail (LLM calls dominate the time). The average backlog at peak was under 500. Lossy jobs are fine — if a certificate render fails we re-queue it on the next request.

For that workload, Kafka is theatre. We don’t need replay. We don’t need to fan out to multiple consumer groups. We don’t need exactly-once semantics — at-least-once is correct, and our jobs are idempotent.

What we do need:

What Redis + BullMQ gave us

What we’d switch for

If we ever hit:

Then Kafka (or NATS JetStream, which we already run) earns its weight. Until then, the simplest thing that handles the load is the right thing.

The lesson

The capacity gap between “what your stack can handle” and “what real companies need” is wider than most architecture posts admit. Count jobs per second before picking infrastructure.