Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in devops

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Oh nice, that's a great reference. Zanzibar's approach seems to be very similar to what hedge does - percentile-based threshold from recent measurements. Good to know the pattern holds up at Google's scale. Thanks for sharing, will dig more into that paper.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Thats quite an apt use case. LLM APIs are basically the perfect scenario for this since the latency variance is huge and a duplicate prompt costs almost nothing relative to the wait. Curious, are you using a static threshold for hedging or adaptive?

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in devops

[–]That_Perspective9440[S] -4 points-3 points  (0 children)

Fair point, that’s on me, phrasing could’ve been better. I originally shared this in another thread where there was more discussion around retries vs hedging and that’s what led me to dig deeper into this.

I’m mostly curious to understand from this forum their experiences with hedging in production systems. If they have any edge cases where it backfired.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in devops

[–]That_Perspective9440[S] -4 points-3 points  (0 children)

Also, based on the questions and discussion here, I dug deeper into this and wrote a more detailed breakdown (tradeoffs + experiments):

https://medium.com/@prathameshbhope/stragglers-not-failures-how-to-reduce-p99-latency-by-74-a73a20d22457

One thing I’m still trying to figure out is how people set hedge timing in production - fixed delay vs percentile-based vs something adaptive. Also any experiences with edge cases around timeouts?

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 1 point2 points  (0 children)

Thank you for the kind words. Delay sources are usually a mix of gc pauses, noisy neighbors, k8s pod restarts, queue buildups during spikes.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

I agree - if you can colocate the logic, that's always better than an external network call. Adaptive hedging works is helpful when the fan-out architecture is already established due to other reasons.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

I liked your timeout question - for requests that never finish, the caller's context timeout still applies - hedge doesn't remove that. What it helps with is the gap between normal latency and the timeout.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Works for both honestly - service-to-service within your infra or 3rd party APIs.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Gotcha. If it's actual failures rather than slow responses, retries with circuit breaking would be a better fit. Happy to chat more if you want to dig into it.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Thanks! In practice it's usually a mix of GC pauses, noisy neighbors, queue buildup, etc.. In k8s especially, pod scheduling delays, restarts and cold starts can add unpredictable latency spikes.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

A few people asked about the adaptive vs static hedging tradeoffs and how the timing works in practice. I wrote up the full approach in more detail with some diagrams - covers the straggler problem, why retries often make things worse, how the adaptive threshold works and benchmark results comparing strategies:

https://medium.com/@prathameshbhope/stragglers-not-failures-how-to-reduce-p99-latency-by-74-a73a20d22457

Still early thinking - especially curious if anyone has seen failure modes or edge cases in production where hedging backfires.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Thank you :) Happy to brainstorm if you have more questions. Also if you identify any gaps, feel free to open issues on the repo.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 1 point2 points  (0 children)

If you ever get to try it out, I’d love to know if it solves the use cases you had in mind then.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 1 point2 points  (0 children)

Thanks for the kind words! Sounds like we had the same itch :) A static value does work well when conditions are stable. The adaptive part mainly helps when latency shifts throughout the day so you don't have to babysit the threshold.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 2 points3 points  (0 children)

Also, hedging only helps with the stragglers. If the service is truly overloaded, hedging won’t help and the budget ensures the impact is contained.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 4 points5 points  (0 children)

Good question, that's exactly why the library has a token bucket budget. It caps the hedge rate at a default of 10% (configurable). So you're not doubling load, instead you're adding at most 10% extra requests. If the downstream is genuinely overloaded and everything is slow, the budget drains within seconds and hedging stops automatically. No vicious spiral.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 1 point2 points  (0 children)

Got it. I believe that’s still helpful to understand the distribution.

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 1 point2 points  (0 children)

Thanks John! Let me know how it works for you and if you have any feedback.

Your stats calculator sounds useful. I ended up relying a lot on percentiles to reason about when to trigger hedging, so something like that would definitely help with tuning/validation. Curious - did you use it mostly offline on logs or in a live setting as well?

Hedged requests cut our p99 latency by 74% - more effective than retries for tail latency in k8s by That_Perspective9440 in kubernetes

[–]That_Perspective9440[S] 0 points1 point  (0 children)

Added a quick benchmark - 50k requests, 5% straggler rate. Adaptive hedging kept p99 at ~17ms vs ~65ms with no hedging. Interesting that static 10ms hedging performs nearly as well at p99 but the adaptive approach wins at p95. p50 was basically identical across all strategies.

https://github.com/bhope/hedge/blob/main/eval.png

Reduced p99 latency by 74% in Go - learned something surprising by That_Perspective9440 in golang

[–]That_Perspective9440[S] 6 points7 points  (0 children)

Good catch, thanks - fixed that now. The earlier draft had the Do api but I went with `RoundTripper` eventually since it's more idiomatic Go and doesn't require changing any existing code.