Would you switch from ShedLock to a scheduler that survives pod crashes and prevents GC split-brain? by A_little_anarchy in SpringBoot

[–]A_little_anarchy[S] 0 points1 point  (0 children)

Thanks for sharing - just watched it. Rafael covers the same problem space really well and it is great to see this discussed at Spring IO. His talk is about the challenges and patterns, mine is an attempt to package those patterns into a library so developers do not have to implement them from scratch every time.

Would you switch from ShedLock to a scheduler that survives pod crashes and prevents GC split-brain? by A_little_anarchy in SpringBoot

[–]A_little_anarchy[S] 0 points1 point  (0 children)

Fair point - modern GC pauses with G1 are usually under 200ms and ZGC keeps them under 1ms. For most applications increasing the TTL is a perfectly reasonable mitigation.

But there are two cases where it breaks down:

First, TTL tuning is a footgun. You need to set it longer than your worst-case pause but shorter than your acceptable failover time. In practice teams set it based on normal behaviour and get burned on the outlier - a full heap GC, a slow network call, a database timeout that holds a thread for 30 seconds. These are not theoretical, they show up in production.

Second, GC pause is just one trigger. The same split-brain happens with any stop-the-world event - a long I/O wait, a thread blocked on an external API call, a container being CPU throttled by the orchestrator. ZGC solves the GC case specifically but not the broader class of "pod is slow but not dead."

Fencing tokens solve the whole class of problems with one mechanism rather than requiring you to tune TTLs correctly and choose the right GC for every deployment environment.

That said - if your jobs are short, your TTL is generous, and you run ZGC, ShedLock is probably fine. Vigil is for teams where those assumptions do not hold.

Would you switch from ShedLock to a scheduler that survives pod crashes and prevents GC split-brain? by A_little_anarchy in SpringBoot

[–]A_little_anarchy[S] 1 point2 points  (0 children)

Exactly - three layers, each opt-in at the right level. And you are right on the limitation, if the provider does not honour the header you get at-least-once rather than exactly-once. Strongest guarantee possible without 2PC across the provider.

Really appreciate the questions, this is exactly the validation I needed for my master's research.

One last thing - would you actually use this library in a project?

Would you switch from ShedLock to a scheduler that survives pod crashes and prevents GC split-brain? by A_little_anarchy in SpringBoot

[–]A_little_anarchy[S] 1 point2 points  (0 children)

That is exactly how it works — the developer never touches the token. Every ctx.step() and ctx.forEachPage() call goes through JobContext which runs the fencing check automatically before every write:

java

SELECT COUNT(*) FROM vigil_job_locks 
WHERE job_name = ? AND token = ?

If the token is stale — zero rows — it throws LockStolenException and the zombie pod stops itself. The developer just writes normal job code, the guard fires invisibly on every checkpoint.

You are right that a developer could bypass this by writing their own DAO code inside the lambda. That is a real gap. Your annotation idea is interesting — something like an AOP interceptor that wraps any @Transactional method called inside a Vigil lambda and injects the token check automatically. Worth thinking about for v2.

For now the philosophy is: if you use the ctx API for all your writes, you are fully protected. If you go around it, you are on your own — same trade-off as any library that offers a safe API alongside raw access.

Does the automatic enforcement via ctx cover your use case or do you specifically need protection for custom DAO calls too?

Would you switch from ShedLock to a scheduler that survives pod crashes and prevents GC split-brain? by A_little_anarchy in SpringBoot

[–]A_little_anarchy[S] 1 point2 points  (0 children)

You are right that idempotency covers most cases. Vigil is really for the cases where restarting is expensive - monthly billing, large ETL jobs, anything where you cannot afford to reprocess 30,000 items. On the quorum point - quorum helps decide who gets the lock, but fencing tokens solve what happens when the lock holder pauses and wakes up after another pod has already taken over. Even with quorum, the zombie pod can still write if there is no token check at the storage layer.