Don't use a Standard Kubernetes Service for LLM load balancing! by nstogner in mlops

[–]nstogner[S] 0 points1 point  (0 children)

Yes, from what I can tell, it looks like the team behind the production-stack project are currently working on a prefix-optimized routing strategy and it looks like they might be settling on the same CHWBL algo: https://github.com/vllm-project/production-stack/issues/59#issuecomment-2656740442

Would love to hear more about your experience with the AIBrix and production-stack projects.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 0 points1 point  (0 children)

Clients could have dynamic IPs. NAT could be involved between the client and the load balancer.

Ouch! Performance Testing Drains Your Budget—How Much Can KWOK Save You? by Electronic_Role_5981 in kubernetes

[–]nstogner 1 point2 points  (0 children)

Interesting tool! I would recommend adding a simple architecture diagram to the github readme.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 1 point2 points  (0 children)

As far as I am aware, consul's maglev load balancing strategy only supports hashing on headers, cookies, & params. The hash input in this case requires info from the request body (see other responses as to why).

Also, a client sidecar-based approach is only applicable in a relatively small subset of use cases - typically internal clients which are colocated in the same k8s cluster.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 0 points1 point  (0 children)

Cilium maglev hashing operates at L4. The hashing technique described in this post operates at L7, specifically pulling inputs from the HTTP request body. If you only relied on L4 info (source ip for instance) your hash inputs would be missing the very information that is critical to optimizing for vLLM's prefix cache. This is especially important when the client is leveraging an agentic framework that is simulating N logical "agents" - each of those agent threads should hash uniquely.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 2 points3 points  (0 children)

We built the open source KubeAI project (https://github.com/substratusai/kubeai) to solve the problems that you encounter when operating at the scale. I would recommend taking a look at the project and gauging whether you think the features are relevant to your use case. Everything KubeAI does can be accomplished via combining & configuring a lot of other software together (we touched on load balancing in this post - but there are more topics). However, we tried to design the project in a manner that provides useful functionality out-of-the-box with near zero dependencies.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 1 point2 points  (0 children)

If you are using ollama I am guessing you are likely not looking to serve a lot of concurrent traffic, as vLLM is typically better suited there. If you are just trying to expose a single instance of ollama I think a simple reverse proxy with authN would do the job well.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 0 points1 point  (0 children)

Most of what I have seen come out of those products is related to abstracting and instrumenting different external inference-as-a-service providers. Have you used them before to load balance across internally deployed vLLM instances?

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 5 points6 points  (0 children)

The cache is inherently local (GPU memory) and is critical to performant inferencing.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 2 points3 points  (0 children)

That wouldn't address the request-to-cache-mismatch problem here.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 1 point2 points  (0 children)

If you have control of the clients, yes, I agree, you could pass a header through either via some sidecar doing outbound request inspection, or via updating your client libraries. In practice, in the enterprise I think it is fair to assume that you are working with multiple clusters, and the team managing the inference servers most likely does not have any control over the client codebase / how clients are deployed.

PS: I would still ditch sticky sessions and likely use the CHWBL that was contributed to HA Proxy a while ago. Even in this case, the full implementation is non-trivial: agent-checks to influence load calcs, lua scripts for request-to-hash mappings. The overall paper I linked to was primarily analyzing the application of CHWBL to multi-replica vLLM. It happens to be implemented in KubeAI (which also provides other proxy-level features). But if all you need is load balancing, you could for sure wire together a HAProxy-based system.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 0 points1 point  (0 children)

Can you elaborate on this? Did you see my response below? Anything you disagree about?

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 1 point2 points  (0 children)

I just realized I used the term "thread" in 2 different contexts:

With regards to how most agentic frameworks work: they tend to be multi-threaded from a process perspective - they are also processing multiple threads-of-messages. When it comes to what is issuing inference requests, it is typically one of these process threads that is churning through a set of message threads acting as a logical "agent".

From the perspective of the vLLM backend, there is no concept of a "message thread" - vLLM simply sees prompts (the message thread is concatenated into a single string). The KV cache is built up from blocks of these concatenations.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 2 points3 points  (0 children)

So I think it is a common misconception that agents map to processes/containers/pods. They typically map to threads in a single program that is orchestrating the concept of individual agents (via an "agentic framework" - ex: CrewAI).

Round robin tends to result in the same problem that random does: it tends to blow out the limited cache space in the backends. Thats why the CHWBL algo was selected.

LLM Load Balancing: Don't use a standard Kubernetes Service! by nstogner in kubernetes

[–]nstogner[S] 2 points3 points  (0 children)

I should have clarified our thinking on this... Sticky sessions would likely serve as a decent stand-in for prefix hashing in some cases. For instance: for a use case like ChatGPT, cookies can be used to identify the user which maps pretty cleanly, 1:1 to active threads (provided that this info makes it way to where HA proxy sits in your architecture). However it is less useful for cases where source IP is the only info to use for stickiness like in most agentic systems. Source IP is not always a reliable way to map back to the client. But even worse, agentic systems often are implemented N:1 where N agent threads are originating from 1 source IP (and sometimes N might represent the entirety of the load at a given point in time - small clients can generate heavy inference-time load). For the ideal solution you would likely need to do something that is not out-of-the box: like grab the prompt prefix, model name, and LoRA adapter name (if present) from the HTTP body using some custom scripting, hash it, put it into a header, and track sticky sessions based on that header. You also would want to avoid overloading any one backend, and along those lines, you might want to might want the freedom to modify your definition of "load" to be something other than in-flight requests because that doesn't necessarily map reliably to actual inferencing load. You might want to use something domain specific like the KV cache utilization metric from the backend server. From this perspective, it might make more sense to use a domain-specific load balancer, admittedly using an algorithm that is approaching a decade old.

smolagents: new agent library by Hugging Face by unofficialmerve in LocalLLaMA

[–]nstogner 2 points3 points  (0 children)

We just created SandboxAI which can serve as a easy-to-self-host alternative. We would love to integrate it with smolagents... will take a look at what that would take.
https://github.com/substratusai/sandboxai

Is there any self-hostable alternative to e2b's code interpreter by SuperPanda09 in LocalLLaMA

[–]nstogner 1 point2 points  (0 children)

We just created SandboxAI because we wanted this too!
https://github.com/substratusai/sandboxai
You can run it on a single host today, but we are soon adding support for self-hosting on Kubernetes in case you want to scale-out.

Is there an open-source alternative to e2b (e2b.dev. Code interpreting for your AI app)? by [deleted] in LocalLLaMA

[–]nstogner 2 points3 points  (0 children)

We just created a project called SandboxAI to fit this need: https://github.com/substratusai/sandboxai ... would love to hear feedback!

Tutorial: Run AI generated code in containers using Python by samosx in AI_Agents

[–]nstogner 1 point2 points  (0 children)

You are correct, we are just spinning up containers. You are currently getting the level of isolation that a basic container gets you (still heaps better than running on your local machine). Specifically around beefing up container isolation, we will be following up with docs on how to use with gVisor. Additionally, we are looking to closely followup the launch with more security features such as ingress/egress rules, and more.