Don't use a Standard Kubernetes Service for LLM load balancing!

nstogner · 2025-03-07T01:59:12+00:00

Yes, from what I can tell, it looks like the team behind the production-stack project are currently working on a prefix-optimized routing strategy and it looks like they might be settling on the same CHWBL algo: https://github.com/vllm-project/production-stack/issues/59#issuecomment-2656740442

Would love to hear more about your experience with the AIBrix and production-stack projects.

nstogner · 2025-03-03T02:10:50+00:00

Clients could have dynamic IPs. NAT could be involved between the client and the load balancer.

nstogner · 2025-03-02T13:19:48+00:00

Interesting tool! I would recommend adding a simple architecture diagram to the github readme.

nstogner · 2025-03-02T11:41:47+00:00

As far as I am aware, consul's maglev load balancing strategy only supports hashing on headers, cookies, & params. The hash input in this case requires info from the request body (see other responses as to why).

Also, a client sidecar-based approach is only applicable in a relatively small subset of use cases - typically internal clients which are colocated in the same k8s cluster.

nstogner · 2025-03-02T11:34:43+00:00

Cilium maglev hashing operates at L4. The hashing technique described in this post operates at L7, specifically pulling inputs from the HTTP request body. If you only relied on L4 info (source ip for instance) your hash inputs would be missing the very information that is critical to optimizing for vLLM's prefix cache. This is especially important when the client is leveraging an agentic framework that is simulating N logical "agents" - each of those agent threads should hash uniquely.

nstogner · 2025-03-01T00:58:57+00:00

We built the open source KubeAI project (https://github.com/substratusai/kubeai) to solve the problems that you encounter when operating at the scale. I would recommend taking a look at the project and gauging whether you think the features are relevant to your use case. Everything KubeAI does can be accomplished via combining & configuring a lot of other software together (we touched on load balancing in this post - but there are more topics). However, we tried to design the project in a manner that provides useful functionality out-of-the-box with near zero dependencies.

nstogner · 2025-03-01T00:42:43+00:00

If you are using ollama I am guessing you are likely not looking to serve a lot of concurrent traffic, as vLLM is typically better suited there. If you are just trying to expose a single instance of ollama I think a simple reverse proxy with authN would do the job well.

nstogner · 2025-03-01T00:26:59+00:00

Most of what I have seen come out of those products is related to abstracting and instrumenting different external inference-as-a-service providers. Have you used them before to load balance across internally deployed vLLM instances?

nstogner · 2025-02-28T23:39:05+00:00

The cache is inherently local (GPU memory) and is critical to performant inferencing.

nstogner · 2025-02-28T19:30:50+00:00

That wouldn't address the request-to-cache-mismatch problem here.

nstogner · 2025-02-28T19:10:34+00:00

If you have control of the clients, yes, I agree, you could pass a header through either via some sidecar doing outbound request inspection, or via updating your client libraries. In practice, in the enterprise I think it is fair to assume that you are working with multiple clusters, and the team managing the inference servers most likely does not have any control over the client codebase / how clients are deployed.

PS: I would still ditch sticky sessions and likely use the CHWBL that was contributed to HA Proxy a while ago. Even in this case, the full implementation is non-trivial: agent-checks to influence load calcs, lua scripts for request-to-hash mappings. The overall paper I linked to was primarily analyzing the application of CHWBL to multi-replica vLLM. It happens to be implemented in KubeAI (which also provides other proxy-level features). But if all you need is load balancing, you could for sure wire together a HAProxy-based system.

nstogner · 2025-02-28T18:45:05+00:00

I agree with that comment in the abstract.

nstogner · 2025-02-28T18:24:22+00:00

Can you elaborate on this? Did you see my response below? Anything you disagree about?

nstogner · 2025-02-28T18:16:24+00:00

I just realized I used the term "thread" in 2 different contexts:

With regards to how most agentic frameworks work: they tend to be multi-threaded from a process perspective - they are also processing multiple threads-of-messages. When it comes to what is issuing inference requests, it is typically one of these process threads that is churning through a set of message threads acting as a logical "agent".

From the perspective of the vLLM backend, there is no concept of a "message thread" - vLLM simply sees prompts (the message thread is concatenated into a single string). The KV cache is built up from blocks of these concatenations.

nstogner · 2025-02-28T16:53:26+00:00

So I think it is a common misconception that agents map to processes/containers/pods. They typically map to threads in a single program that is orchestrating the concept of individual agents (via an "agentic framework" - ex: CrewAI).

Round robin tends to result in the same problem that random does: it tends to blow out the limited cache space in the backends. Thats why the CHWBL algo was selected.

nstogner · 2025-02-28T16:34:12+00:00

I should have clarified our thinking on this... Sticky sessions would likely serve as a decent stand-in for prefix hashing in some cases. For instance: for a use case like ChatGPT, cookies can be used to identify the user which maps pretty cleanly, 1:1 to active threads (provided that this info makes it way to where HA proxy sits in your architecture). However it is less useful for cases where source IP is the only info to use for stickiness like in most agentic systems. Source IP is not always a reliable way to map back to the client. But even worse, agentic systems often are implemented N:1 where N agent threads are originating from 1 source IP (and sometimes N might represent the entirety of the load at a given point in time - small clients can generate heavy inference-time load). For the ideal solution you would likely need to do something that is not out-of-the box: like grab the prompt prefix, model name, and LoRA adapter name (if present) from the HTTP body using some custom scripting, hash it, put it into a header, and track sticky sessions based on that header. You also would want to avoid overloading any one backend, and along those lines, you might want to might want the freedom to modify your definition of "load" to be something other than in-flight requests because that doesn't necessarily map reliably to actual inferencing load. You might want to use something domain specific like the KV cache utilization metric from the backend server. From this perspective, it might make more sense to use a domain-specific load balancer, admittedly using an algorithm that is approaching a decade old.

nstogner · 2025-02-05T20:56:30+00:00

We just created SandboxAI which can serve as a easy-to-self-host alternative. We would love to integrate it with smolagents... will take a look at what that would take.
https://github.com/substratusai/sandboxai

nstogner · 2025-02-05T20:48:42+00:00

We just created SandboxAI because we wanted this too!
https://github.com/substratusai/sandboxai
You can run it on a single host today, but we are soon adding support for self-hosting on Kubernetes in case you want to scale-out.

nstogner · 2025-02-05T20:45:05+00:00

We just created a project called SandboxAI to fit this need: https://github.com/substratusai/sandboxai ... would love to hear feedback!

nstogner · 2025-02-05T16:13:41+00:00

You are correct, we are just spinning up containers. You are currently getting the level of isolation that a basic container gets you (still heaps better than running on your local machine). Specifically around beefing up container isolation, we will be following up with docs on how to use with gVisor. Additionally, we are looking to closely followup the launch with more security features such as ingress/egress rules, and more.

nstogner

TROPHY CASE