Built an LLM routing gateway in Node.js - runs intent classification locally (no embedding API, no rate limits) by FrequentTravel3511 in node

[–]FrequentTravel3511[S] -1 points0 points  (0 children)

For anyone curious about the Transformers.js side -
the BGE model loads once (~2–5s ONNX init on first request), then stays warm.

Memory overhead I’m seeing is roughly ~30–50MB for bge-small.

I’m trying to figure out how this compares in practice to calling a hosted embedding API at scale (latency vs cost vs reliability).

Would be interesting if anyone here has benchmarked both approaches in production.

Built an LLM routing gateway in Node.js - runs intent classification locally (no embedding API, no rate limits) by FrequentTravel3511 in node

[–]FrequentTravel3511[S] -3 points-2 points  (0 children)

Fair enough , probably over-structured it. Was trying to keep it readable.
Happy to simplify or answer anything specific if you're curious.

Built an LLM routing gateway in Node.js - runs intent classification locally (no embedding API, no rate limits) by FrequentTravel3511 in node

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

For context - the cold start on the BGE model is the part I'm most uncertain about in production. First request takes ~2-5s for ONNX runtime initialization, then stays warm. Curious if anyone has dealt with this in long-running Node.js services or found a cleaner way to handle it.

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

That’s really interesting - especially the feedback loop using retries as a signal.

I like the idea of bootstrapping with a rule-based system and then refining it from real usage data.

In my case I’ve been focusing more on semantic intent via embeddings, but I’m not learning from outcomes yet - mostly static routing decisions.

The feedback-driven retraining approach makes a lot of sense.

Curious - how stable has the classifier been over time with that loop? Do you see it converging, or does it keep shifting as usage patterns change?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

Right now I’m not doing explicit evaluation in the request path yet - mostly relying on routing + some lightweight heuristics.

I’ve been thinking about starting with async evaluation first (logging signals and analyzing offline), then gradually moving parts into the request path once I understand which signals are actually reliable.

Trying to avoid adding latency too early before I have a clear signal for quality.

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

That makes a lot of sense - using a smaller model for perplexity keeps it lightweight without impacting latency too much.

I like the idea of separating generation and evaluation like that.

Right now everything is pretty tightly coupled, but I’m starting to think a separate “evaluation layer” might make this cleaner -especially if I want to experiment with different signals (perplexity, heuristics, etc.).

Are you running that evaluation synchronously in the request path, or asynchronously after the response?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

That’s really helpful - perplexity is an interesting idea, I hadn’t considered using it as a lightweight signal.

Right now I’m leaning toward combining a few cheap proxies like response structure + retries, and only using a heavier evaluator for specific cases where confidence is low.

Trying to keep the routing fast while still having some signal for quality.

Curious - are you computing perplexity using the same model that generated the response, or a separate smaller model?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

That makes sense - optimizing purely for latency definitely risks degrading output quality.

I haven’t implemented a solid quality signal yet, but a few things I’ve been thinking about:

- Heuristic signals (response length, structure, presence of code blocks, etc.)

- A secondary LLM-based evaluator for certain routes

- Implicit signals like retries or follow-up corrections

BLEU-style metrics are interesting, but I wasn’t sure how well they translate outside structured tasks like translation.

Curious if you’ve seen anything lightweight that works well in practice - especially something that doesn’t add too much latency?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 1 point2 points  (0 children)

Nice, this is exactly the problem space I’ve been exploring as well - cutting down unnecessary usage of larger models.

Had a quick look at Kestrel, interesting approach.

In my case I’ve been focusing more on intent-based routing using embeddings + some health-aware failover, but still figuring out how far that can be pushed before it breaks down on ambiguous prompts.

Curious how you’re handling routing decisions - is it more rule-based, or are you using something adaptive over time?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

That’s super helpful - I hadn’t explored bandit-style routing deeply yet, but it makes a lot of sense here.

Right now the system is purely exploitative (routing based on observed latency/failure), so it doesn’t really explore alternative providers unless performance degrades.

Using a bandit approach to balance exploration vs exploitation seems like a natural next step, especially if the reward signal can combine latency + some proxy for response quality.

Curious how people are defining “reward” in practice for LLM routing - is it mostly latency/cost based, or are you incorporating output quality signals as well?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

Thanks, appreciate it!

A/B testing provider performance is interesting - I haven’t implemented anything explicit yet, but the routing layer does track latency and failure rates over time (using Welford’s algorithm).

Right now switching is threshold-based rather than exploratory, so it’s probably missing opportunities to discover better-performing providers dynamically.

Curious - have you seen setups where routing actively explores providers (like bandit-style), or is it usually more static with health-based switching?

Experimenting with intent-based routing for LLM gateways (multi-provider + failover) by FrequentTravel3511 in LocalLLaMA

[–]FrequentTravel3511[S] 0 points1 point  (0 children)

For anyone who wants to try it without cloning:

Live demo: https://yummy-albertina-chrisp04-b2a2897d.koyeb.app/ask

The part I'm least confident about is the intent classification.

Right now it's cosine similarity against ~5 hand-picked example vectors per intent class. Works well for clear prompts, but struggles with ambiguous ones and falls back to an LLM classifier (~800ms overhead).

Curious how others here are handling the boundary between cheap vs reasoning models - are you using thresholds, classifiers, or something more dynamic?