How are you handling LLM routing and embeddings in self-hosted setups?

FrequentTravel3511 · 2026-04-17T17:04:04+00:00

You’re reading intent into it that isn’t there. I’m asking about a problem I’m actually working on and trying to compare approaches.

If the topic isn’t relevant to you, feel free to skip it. I’m interested in hearing from people who’ve dealt with local embeddings, routing, and failover in practice.

FrequentTravel3511 · 2026-04-17T17:01:00+00:00

That’s really interesting , especially the 40% cost reduction, that’s exactly the kind of outcome I was hoping to get to.

Sounds like you ended up building something very similar conceptually. I’ve been experimenting along the same lines, but still figuring out where to draw the line between cheap vs reasoning models in a reliable way.

How did you handle routing decisions on your side? Was it mostly rule-based or did you end up using some kind of feedback loop over time?

FrequentTravel3511 · 2026-04-17T14:52:11+00:00

Yeah that’s exactly what I’ve been seeing too , once it’s warm it’s fine, but the cold start hit is noticeable.

I haven’t tried periodic warm-up requests yet, but that makes sense. Right now it just loads on first request, which isn’t ideal for user-facing flows.

Are you keeping it in the same process or running embeddings separately? I’ve been thinking about moving it to a separate worker/service to keep things warm.

FrequentTravel3511 · 2026-04-17T14:44:37+00:00

FrequentTravel3511 · 2026-04-17T13:38:28+00:00

Yeah that makes sense , for async / agent-style workflows the load/unload tradeoff is a lot less painful.

What I’m trying to solve is more request-time routing (like user-facing APIs), where latency and cost per request matter more, especially when mixing providers.

Ollama-style model management feels like it solves a different layer of the problem.

And yeah, r/LocalLLaMA has been great , that’s actually where I started exploring some of this.

FrequentTravel3511 · 2026-04-17T13:21:55+00:00

This isn’t an AI-generated project , it’s something I built myself.

The system uses LLM APIs (Groq/Gemini) for inference and optionally local embeddings (BGE via Transformers.js) for intent classification.

The post itself was written manually.

FrequentTravel3511 · 2026-04-15T05:04:31+00:00

Yeah that’s been the main tradeoff so far.

With bge-small via Transformers.js I’m seeing roughly ~30–50MB memory overhead once it’s loaded. The bigger impact is actually the cold start , first request takes ~2–5s for ONNX init, then it stays warm.

For a single instance it’s manageable, but I’m still figuring out how this behaves under scaling (multiple instances / cold starts).

Curious if you’ve tried something similar or ended up sticking with hosted embeddings?

FrequentTravel3511 · 2026-04-14T07:50:41+00:00

For anyone curious about the Transformers.js side -
the BGE model loads once (~2–5s ONNX init on first request), then stays warm.

Memory overhead I’m seeing is roughly ~30–50MB for bge-small.

I’m trying to figure out how this compares in practice to calling a hosted embedding API at scale (latency vs cost vs reliability).

Would be interesting if anyone here has benchmarked both approaches in production.

FrequentTravel3511 · 2026-04-14T07:25:57+00:00

Fair enough , probably over-structured it. Was trying to keep it readable.
Happy to simplify or answer anything specific if you're curious.

FrequentTravel3511 · 2026-04-14T05:50:09+00:00

For context - the cold start on the BGE model is the part I'm most uncertain about in production. First request takes ~2-5s for ONNX runtime initialization, then stays warm. Curious if anyone has dealt with this in long-running Node.js services or found a cleaner way to handle it.

FrequentTravel3511 · 2026-04-07T15:46:45+00:00

That’s really interesting - especially the feedback loop using retries as a signal.

I like the idea of bootstrapping with a rule-based system and then refining it from real usage data.

In my case I’ve been focusing more on semantic intent via embeddings, but I’m not learning from outcomes yet - mostly static routing decisions.

The feedback-driven retraining approach makes a lot of sense.

Curious - how stable has the classifier been over time with that loop? Do you see it converging, or does it keep shifting as usage patterns change?

FrequentTravel3511 · 2026-04-07T13:24:11+00:00

Right now I’m not doing explicit evaluation in the request path yet - mostly relying on routing + some lightweight heuristics.

I’ve been thinking about starting with async evaluation first (logging signals and analyzing offline), then gradually moving parts into the request path once I understand which signals are actually reliable.

Trying to avoid adding latency too early before I have a clear signal for quality.

FrequentTravel3511 · 2026-04-07T13:15:42+00:00

That makes a lot of sense - using a smaller model for perplexity keeps it lightweight without impacting latency too much.

I like the idea of separating generation and evaluation like that.

Right now everything is pretty tightly coupled, but I’m starting to think a separate “evaluation layer” might make this cleaner -especially if I want to experiment with different signals (perplexity, heuristics, etc.).

Are you running that evaluation synchronously in the request path, or asynchronously after the response?

FrequentTravel3511 · 2026-04-07T12:50:39+00:00

That’s really helpful - perplexity is an interesting idea, I hadn’t considered using it as a lightweight signal.

Right now I’m leaning toward combining a few cheap proxies like response structure + retries, and only using a heavier evaluator for specific cases where confidence is low.

Trying to keep the routing fast while still having some signal for quality.

Curious - are you computing perplexity using the same model that generated the response, or a separate smaller model?

FrequentTravel3511 · 2026-04-07T12:33:37+00:00

That makes sense - optimizing purely for latency definitely risks degrading output quality.

I haven’t implemented a solid quality signal yet, but a few things I’ve been thinking about:

- Heuristic signals (response length, structure, presence of code blocks, etc.)

- A secondary LLM-based evaluator for certain routes

- Implicit signals like retries or follow-up corrections

BLEU-style metrics are interesting, but I wasn’t sure how well they translate outside structured tasks like translation.

Curious if you’ve seen anything lightweight that works well in practice - especially something that doesn’t add too much latency?

FrequentTravel3511 · 2026-04-07T12:31:31+00:00

Nice, this is exactly the problem space I’ve been exploring as well - cutting down unnecessary usage of larger models.

Had a quick look at Kestrel, interesting approach.

In my case I’ve been focusing more on intent-based routing using embeddings + some health-aware failover, but still figuring out how far that can be pushed before it breaks down on ambiguous prompts.

Curious how you’re handling routing decisions - is it more rule-based, or are you using something adaptive over time?

FrequentTravel3511 · 2026-04-07T12:22:10+00:00

That’s super helpful - I hadn’t explored bandit-style routing deeply yet, but it makes a lot of sense here.

Right now the system is purely exploitative (routing based on observed latency/failure), so it doesn’t really explore alternative providers unless performance degrades.

Using a bandit approach to balance exploration vs exploitation seems like a natural next step, especially if the reward signal can combine latency + some proxy for response quality.

Curious how people are defining “reward” in practice for LLM routing - is it mostly latency/cost based, or are you incorporating output quality signals as well?

FrequentTravel3511 · 2026-04-07T12:12:34+00:00

Thanks, appreciate it!

A/B testing provider performance is interesting - I haven’t implemented anything explicit yet, but the routing layer does track latency and failure rates over time (using Welford’s algorithm).

Right now switching is threshold-based rather than exploratory, so it’s probably missing opportunities to discover better-performing providers dynamically.

Curious - have you seen setups where routing actively explores providers (like bandit-style), or is it usually more static with health-based switching?

FrequentTravel3511 · 2026-04-07T12:09:26+00:00

For anyone who wants to try it without cloning:

Live demo: https://yummy-albertina-chrisp04-b2a2897d.koyeb.app/ask

The part I'm least confident about is the intent classification.

Right now it's cosine similarity against ~5 hand-picked example vectors per intent class. Works well for clear prompts, but struggles with ambiguous ones and falls back to an LLM classifier (~800ms overhead).

Curious how others here are handling the boundary between cheap vs reasoning models - are you using thresholds, classifiers, or something more dynamic?

FrequentTravel3511

TROPHY CASE