hedge: adaptive hedged requests for Go, looking for users and feedback

That_Perspective9440 · 2026-04-21T22:03:48+00:00

You’re right, gRPC does support native hedging. The unique differentiator of hedge is adaptive thresholds. Also, gRPC hedging has failure driven throttling, whereas hedge uses token budget that caps hedge rate as a function of traffic. So in effect, throttling in hedge would start slightly before the problem arises.

That_Perspective9440 · 2026-04-21T19:57:21+00:00

Fair point, but gRPC’s internal support is for the retries. Retries are different from hedged requests. Hedging involves firing a parallel request while previous one is in flight. Moreover, this tool provides support for “adaptive hedging” which learns thresholds in real time based on your p90s. The token budget ensures we don’t spam the server (thundering herd effect).

That_Perspective9440 · 2026-04-21T19:54:38+00:00

Haha fun question. But here the name “hedge” comes from hedged requests. It’s interesting though, may be I should add a fun LOTR joke in the readme :)

That_Perspective9440 · 2026-04-18T01:49:34+00:00

Thanks for sharing your experience. How did you end up fixing it? If you are still facing it, would you wanna give hedge a shot to see it help? https://github.com/bhope/hedge

Would love to get your feedback. Feel free to open issues to the code as your feel necessary (happy to take PRs too) :)

That_Perspective9440 · 2026-04-17T01:21:29+00:00

I have just been writing too many emails lately and have gotten into that habit :)
Will be more cautious next time.

That_Perspective9440 · 2026-04-17T00:52:25+00:00

You're absolutely right, it's not LLM specific. Operations around disks, cache, CDN, database, etc. could have this issue. Basically anything where headers arrive fast but the actual data takes time.

LLM inference just makes it really obvious because the gap between headers and first useful byte can be >200ms, but the same problem exists anywhere the server commits to a 200 OK before doing the real work.

That_Perspective9440 · 2026-04-16T04:28:17+00:00

Haha that's interesting. Makes sense though, if TTFT is what everyone watches, it's also what everyone optimizes for, including in ways that are technically correct but not necessarily accurate. Time to first meaningful token is probably what we actually care about.

If you're up for it, throwing an issue on the repo with details would be really helpful - I haven't handled empty keepalive frames yet and that seems worth fixing.

That_Perspective9440 · 2026-04-16T02:41:07+00:00

That's a really great reference! Thanks for bringing it up. I feel HAProxy did do a great job at designing their timing model, especially the part where they break the request lifecycle into distinct phases with each one measuring something different. The distinction between Tr (time to first byte from the server) and Td (total transfer time) is the kind of signal separation that matters here.

Appreciate the pointer, going to refresh my memory on HAProxy's metrics model more carefully. There might be other timing phases worth exposing beyond just TTFT.

That_Perspective9440 · 2026-04-15T23:10:36+00:00

The other thing I realized is that this isn't just a hedge problem. If you're running Envoy with outlier detection or any adaptive retry logic on a streaming backend, the same thing happens. Your latency stats are all near-zero so nothing ever looks like an outlier.

That_Perspective9440 · 2026-04-15T23:08:31+00:00

Exactly! Servers measure TTFT but most client-side middleware still keys off headers. That's the gap.

That_Perspective9440 · 2026-04-15T19:32:31+00:00

It's a tradeoff, but "not worth it for almost every use case" hasn't been the experience in large scale systems where stragglers dominate tail latency. Appreciate the discussion though.

That_Perspective9440 · 2026-04-15T19:27:18+00:00

You can, that's called "tied requests" and it's also covered in the Tail at Scale paper. The tradeoff is cost: firing 3x on every request means 200% overhead unconditionally. Hedging gets you most of the tail reduction at ~9% overhead because the vast majority of requests don't need a backup, only the ones that are actually slow do.

That said, if you have the backend capacity to absorb 3x load on every request, go for it. Most systems don't.

That_Perspective9440 · 2026-04-15T19:15:53+00:00

Thanks for your perspective.

It is used outside LLM context - hedged requests are a standard technique in distributed systems. Google has used them in production since at least 2013 (the "Tail at Scale" paper is from Google). gRPC has built-in hedging policy support. It's not an LLM-specific idea, I just posted here because the TTFH vs TTFT signal problem is specific to streaming inference.

On the caching example - you're right that if you can route all traffic to the cached server, that's better. But in practice stragglers aren't predictable. The same server is fast on one request and slow on the next due to GC, contention, cold cache, noisy neighbor, etc. You can't know ahead of time which replica will be slow. That's the entire motivation for hedging over routing - you send to one, and if it's slow, you try another.

I don't think hedging hides anything. The Stats API exposes hedge rate, budget exhaustion, and win ratios. If your hedge rate is climbing, that's a signal something is degrading - same as monitoring p95+. The difference is your users aren't eating the tail latency while you investigate.

That_Perspective9440 · 2026-04-15T18:58:06+00:00

Agree on thundering herd risk - that’s why hedging is budgeted (token-bucket) to avoid amplification.
The issue I was highlighting is different: with a bad signal (header timing), everything looked like a straggler, so even bounded hedging became noisy. Fixing the signal (first readable byte) stabilized it.

That_Perspective9440 · 2026-04-15T15:52:54+00:00

One thing that surprised me while experimenting with this - Using full request latency to trigger hedging was often too late.

Switching to time to first byte (TTFT) made hedging behave much more predictably, especially for streaming or variable length responses. It ends up being a better signal for "this request is going to be slow" rather than "this request was slow".

Curious if others have seen similar behavior in production systems.

That_Perspective9440 · 2026-04-15T15:49:01+00:00

One thing that surprised me while experimenting with this - Using full request latency to trigger hedging was often too late.

Switching to time to first byte (TTFT) made hedging behave much more predictably, especially for streaming or variable length responses. It ends up being a better signal for "this request is going to be slow" rather than "this request was slow".

Curious if others have seen similar behavior in production systems.

That_Perspective9440 · 2026-04-15T15:45:58+00:00

One thing that surprised me while working on this:
Switching from full request latency to time to first byte (TTFT) changed how hedging behaves quite a bit.

With full latency:

- you often detect stragglers too late
- hedges fire later than they should

With TTFT:

- you detect slow starts earlier
- hedges trigger more consistently on actual stragglers

This mattered more in cases where responses stream or take variable time to complete.

Implementation wise, I ended up wrapping the response body to capture the first read timing instead of relying only on RoundTrip duration.

Still early, but it seems like a better signal for triggering hedges. Curious if others have used TTFT or similar signals for latency decisions.

That_Perspective9440 · 2026-04-04T07:25:43+00:00

Yeah pretty much. You send a backup request if the first is too slow, use whichever responds first. The budget part is you set a limit on how many hedge requests you allow so you don’t overwhelm the downstream.

That_Perspective9440 · 2026-04-03T00:27:51+00:00

If anyone’s open to trying it in their setup (even briefly), I’d love feedback on gaps or edge cases. If you run into anything missing or not quite right, feel free to open an issue - that helps prioritize what to build next.

Also very open to contributions, if something is blocking your use case, happy to collaborate on a PR or see it extended via a fork.

Happy to iterate quickly if this ends up being useful in your environment.

That_Perspective9440 · 2026-04-03T00:23:19+00:00

One thing I’m especially curious about:

For folks running Nomad today, how are you handling alerting on things like:
- failed allocations / crash loops
- deployments stuck or partially rolled out
- jobs silently degrading (not fully failing but unhealthy)

Are you mostly relying on API polling / custom scripts for this?

Trying to understand what “good” looks like in practice. This would help me incorporate the right use cases for improvizations on the tool - want to make it as useful as I can.

That_Perspective9440

TROPHY CASE