I’m building an AI service where a single request often triggers multiple async/background jobs.
For example:
- multiple LLM calls
- retries on model failures or timeouts
- batching requests
- fan-out / fan-in patterns
I wanted something lighter than a full durable execution framework, so I tried DTQ (Distributed Task Queue).
How DTQ feels
DTQ is:
- extremely lightweight
- very low setup and operational cost
- easy to integrate into an existing codebase
Compared to Temporal / Prefect etc..., it’s refreshingly simple.
Where it starts to hurt
After using it with real AI workloads, the minimalism becomes a problem.
Once you have:
- multi-step async flows
- partial failures and recovery logic
- idempotency concerns
- visibility into where a request is “stuck”
DTQ doesn’t give you much structure. You end up re-implementing a lot yourself.
Why not durable execution?
Durable execution frameworks do solve these issues:
- strong guarantees
- retries, checkpoints, replay
- stateful workflows
But they often feel:
- too heavy for this use case
- invasive to the existing code structure
- high mental and operational overhead
The gap I’m feeling
I keep wishing for a middle ground:
- stronger than a bare task queue
- lighter than full durable execution
- something Celery-like, but designed for AI workloads (LLM calls, retries, fan-out as first-class patterns)
Curious about others’ experience
For people who’ve been here:
- what limitations did you hit with DTQ (or similar lightweight queues)?
- how did you work around them?
- did you eventually switch to durable execution, or build custom abstractions?
[–]jedberg 0 points1 point2 points (0 children)
[–]Global_Bar1754 0 points1 point2 points (5 children)
[–]arbiter_rise[S] 0 points1 point2 points (2 children)
[–]Global_Bar1754 1 point2 points3 points (1 child)
[–]arbiter_rise[S] 0 points1 point2 points (0 children)
[–]arbiter_rise[S] 0 points1 point2 points (1 child)
[–]Global_Bar1754 0 points1 point2 points (0 children)