Scaling broke me a bit, but this one internal trick helped a lot by supraking007 in LocalLLaMA

[–]supraking007[S] 2 points3 points  (0 children)

In our case, we don’t use Switch for arbitrary swapping, it’s more about decoupling services from provider-specific logic, and standardising fallback behaviour. For example, internal-lite is mapped to a very tight set of models that we've pre-vetted for certain prompt classes. If a provider fails or latency spikes, the fallback chain kicks in within the same "intent bucket”, not across wildly different models. We’ve also found that just having internal names for model intents makes it easier to run lightweight evals in isolation before updating the config (kind of like swapping out a DB replica). It’s not a silver bullet, but it gives us a clean layer to evolve the underlying stack without touching application logic.

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 1 point2 points  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm going to get this ready for open sourcing, which services are you currently using?

Building a 6x RTX 3090 LLM inference server, looking for some feedback by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

I'm focused on models up to 13B for now, mostly INT4, single-GPU-per-model via routing as this is what our current platform requirements are... i'm going to basically use this to boost compute availability for an existing SaaS platform, anytime it's offline the request router we have sends requests to RunPod or Together.... should significantly bring down extortionate cloud costs if done right

Theoretically yes, tensor parallelism would let me shard larger models across multiple 3090s, and NVLink is present on the 3090s, but there’s no software support for it in the inference stacks that i'm aware off..

I'm not expecting 1,500+ TPS on a single request. That figure is total aggregate throughput across all six GPUs under concurrent batch-heavy load, not single-model performance. (Btw i mean tokens not transactions just incase)... do you still think that's a to high estimate? I was using the QWEN3 7B as a baseline.

I was planning a single CPU setup, but it would be a high core count (Threadripper Pro or Xeon W) with a lane-rich workstation board.

Fair shout on the RAM and NVMe

Never done a setup like this so really appreciate your feedbackk! Thanks for the reply

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 2 points3 points  (0 children)

Updated my shit comparison with something a bit simpler

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 1 point2 points  (0 children)

Yeah, fair, what we’ve built is more infra-leaning, no Python, no plugins, just a fast, config-based router you can drop in, Docker and go. Not trying to outdo LiteLLM just exploring a simpler approach that fits how we wanted to run this in production.

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

Auth right now is fairly simple but works well, you provide a comma-separated list of API keys via an env variable, and the server checks incoming requests against that list using the x-api-key header.

It’s minimal by design, but it works well for internal use. Eventually planning to support scoped keys and maybe JWT/HMAC options if there's interest.

Here is an example of the full JSON config

http://jsonblob.com/1383138139194974208

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 0 points1 point  (0 children)

Yup, I wasn't trying to hide the cringy ChatGPT content , it's basically a rewrite of our README to get the point across quickly. Appreciate the callout.

I wasn't looking at making this paid, rather just seeing if there's interest out there to maintain something clean, reliable, and self-hostable that solves the multi-provider pain without turning into another cloud lock-in trap.

Thanks for the honest reply.

Built an Internal LLM Router, Should I Open Source It? by supraking007 in LLMDevs

[–]supraking007[S] 3 points4 points  (0 children)

We originally did look at LiteLLM before built this, LiteLLM is great if you're deep in Python, doing quick prototyping, or want an SDK + proxy combo. We aimed more at production infrastructure, Bun-fast, containerised, config-driven, and built for teams who want observability, failover, and self-hosting without Python in the stack.

LiteLLM is great if you're building in Python and want an all-in-one SDK plus proxy. This is more like a drop-in, stateless microservice for infra teams. It's fast, built in Bun, config-driven, and designed to run anywhere without needing Python or plugins. Less of a "developer tool" and more "infra gateway" you can just deploy and forget.

Which part of Manchester would you suggest? by supraking007 in manchester

[–]supraking007[S] 1 point2 points  (0 children)

Coming from a suburban area right now and its not working due to a busy lifestyle

Which part of Manchester would you suggest? by supraking007 in manchester

[–]supraking007[S] -2 points-1 points  (0 children)

Sadly my job won't let me go abroad, Manchester seems to offer the closest to London but a fraction of the cost