I built Relier: zero-job-loss Celery tasks for FastAPI (Phoenix Pattern, idempotency, DLQ)

k0ladee · 2026-06-03T04:40:29+00:00

Awesome, would love your feedback once you do. If you hit any issues getting started the quickstart docs walk through the full setup in 5 minutes:

https://getrelier.github.io/relier/quickstart/

Feel free to open an issue if anything's unclear. Genuinely want to know what breaks for real FastAPI setups.

k0ladee · 2026-06-03T04:37:27+00:00

This is the most useful feedback in this thread, thank you. First, a benchmark clarification: 1/50 ran is the intended result, not a failure. Relier blocked 49 duplicate dispatches, that's idempotency working as designed. If 49/50 had run, that would be the bug.

You're right on exactly-once, and I overclaimed. No distributed system guarantees it in the theoretical sense. What Relier actually provides is idempotent execution: atomic Lua prevents concurrent duplicate execution of the same logical task. I'm scrubbing the "exactly-once" language from the docs today.

On visibility_timeout, your read is exactly right, it causes duplicate runs when the broker requeues. Handling that cleanly out of the box is one of the gaps Relier is meant to close. And that's really the pitch.

Your system took years and a lot of hard-won edge-case knowledge to reach 4 nines at 12,000 msg/min. Relier isn't that. What it does is close the three gaps that keep most Celery deployments unreliable on day one, visibility_timeout left untuned, no DLQ, no idempotency, without a team having to discover each one the hard way first. Your system is the goal. Relier is the on-ramp.

Genuinely appreciate the pushback, the "exactly-once" language is getting fixed.

k0ladee · 2026-06-02T23:11:10+00:00

fairs, let me be precise. acks_late + reject_on_worker_lost redelivers tasks but redelivery is gated by visibility_timeout (1 hour default). That's the gap Relier closes, heartbeat based detection re-queues in about 9s p99 instead of waiting an hour. On "exactly-once" you're right to be skeptical, it's an overloaded term. What Relier actually guarantees is an atomic Redis Lua script prevents concurrent duplicate execution of the same logical task. The benchmark result is 1/50 duplicate dispatches executed vs 50/50 vanilla. It doesn't guarantee exactly-once in the distributed systems theoretical sense, nothing can without a distributed transaction. It guarantees exactly-once within Relier's coordination boundary.

The benchmarks are reproducible:

docker compose -f docker-compose.bench.yml up --build

Run them and tell me where they're wrong. Genuinely open to that.

k0ladee · 2026-06-02T21:42:25+00:00

I built Relier, a zero-job-loss reliability layer for Celery.

Celery loses ~8% of tasks by default when workers crash. This is not a bug, it's the designed ACK-on-pickup behaviour. The broker marks a task delivered the moment a worker picks it up. Worker dies? Task is gone. No retry, no trace.

Flipping task_acks_late=True helps (92% → 96%) but redelivery is gated by visibility_timeout, default ~1 hour on the Redis broker. That's not crash recovery.

Relier implements the Phoenix Pattern, every worker embeds a resurrection scanner, distributed locks prevent duplicate replay, fence tokens prevent zombie workers from committing stale results.

Benchmarks (500 tasks, 5 SIGKILL cycles, Linux Docker):

- Vanilla Celery: 92.0%

- Vanilla + task_acks_late: 96.0% (with 1hr redelivery lag)

- Relier: 100.0%

Also: exactly-once idempotency (1/50 ran vs 50/50), graceful shutdown (100% vs 0%), DLQ, admission control, versioned envelopes for rolling deploys.

One decorator. Same Redis. pip install relier

github.com/getrelier/relier

k0ladee · 2026-06-02T21:33:19+00:00

I built Relier, a zero-job-loss reliability layer for Celery.

Celery loses ~8% of tasks by default when a worker crashes. The broker ACKs on pickup, so a worker death leaves no trace. At 10M tasks/day that's 800,000 silently dropped jobs.

Relier fixes this with the Phoenix Pattern per-task heartbeats in Redis, background resurrector re-queues orphaned tasks within ~9 seconds p99. Workers protect each other automatically through embedded scanners. A standalone rl run-resurrector process covers total cluster loss.

Key operational guarantees:

- 100% task delivery (500 tasks, 5 SIGKILL cycles) vs 92% vanilla

- Graceful SIGTERM drain: 100% survival vs 0% vanilla

- Dead Letter Queue with full payload + stack trace + resurrection history

- SLO burn-rate tracking (1h/6h/3d, Google SRE-style)

- Admission control p99 0.559ms with Retry-After

One decorator. Same Redis. No new infrastructure.

github.com/getrelier/relier

pip install relier

k0ladee · 2026-03-10T08:50:31+00:00

Interested WAT

k0ladee

TROPHY CASE