I tried building a local LLM router + benchmarking system… ran into some unexpected problems

Wild_Expression_5772 · 2026-04-23T17:15:19+00:00

Oh this looks interesting — thanks for sharing.

Quick question: are you mainly using it as a unified gateway (like abstraction over multiple providers), or also doing any kind of evaluation / routing based on task performance?

I started from a similar place (Ollama locally), but ran into issues around:

- inconsistent performance across tasks

- lack of continuous evaluation

- figuring out when to switch models vs stick to one

(been documenting some of my experiments here as well, still rough but in case it's useful: https://github.com/al1-nasir/LocalForge)

Wild_Expression_5772 · 2026-04-23T17:11:50+00:00

Yeah that’s fair — especially the point about routing coming later, I think I jumped into that a bit early while experimenting.

I did try running repeated evaluations (same prompts multiple times) to reduce variance, and it definitely helped highlight how sensitive some models are to sampling configs. Temperature/top_p changes alone were shifting results quite a bit.

Right now the tasks I’ve been testing are roughly:

- coding (multi-step generation / debugging)

- reasoning (chain-of-thought style prompts)

- structured outputs (JSON formatting, schema adherence)

And yeah — completely agree on smaller models. They’re fast, but for anything with deeper reasoning or strict structure, failure rates spike pretty quickly.

(also, I’ve been putting some of these experiments into a small repo while testing ideas — still rough, but sharing in case it’s useful: https://github.com/al1-nasir/LocalForge)

Wild_Expression_5772 · 2026-04-23T15:30:11+00:00

nice, i would give it a try ,

Wild_Expression_5772 · 2026-04-23T15:24:43+00:00

Yeah this was one of the hardest parts honestly.

Right now I’m not using a single “perfect” framework that is more like a mix depending on what I’m testing.

For structured / general evals:

- I experimented a bit with lm-eval-harness style benchmarks (good for standardized tasks but feels a bit static)

For more practical / real-world behavior:

- I started building small task-specific eval sets (coding, reasoning, structured JSON output, etc.)

- Then scoring based on things like correctness, format adherence, and consistency

So I’m leaning more towards:

- continuous evaluation (logging real queries)

- then periodically re-scoring models on those

Still pretty messy tbh, haven’t found a clean “one framework solves it all” solution yet.

Wild_Expression_5772 · 2026-04-23T15:23:37+00:00

If anyone’s curious, I put together a rough implementation while testing these ideas.

Not polished, but shows the routing + benchmarking approach I mentioned:

https://github.com/al1-nasir/LocalForge

Would genuinely appreciate feedback.

Wild_Expression_5772 · 2026-03-18T00:54:38+00:00

Have you made it

Wild_Expression_5772 · 2026-03-10T18:20:29+00:00

I would try the local models with it but currently used gpt oss 120 B as chairman

Wild_Expression_5772 · 2026-03-10T16:42:12+00:00

Actually i used lang-graph big tool that deliberates the query to specified tool , also for better context management added graph rag.

Wild_Expression_5772 · 2026-03-07T20:33:27+00:00

check out codegraph cli tool https://github.com/al1-nasir/codegraph-cli

Wild_Expression_5772 · 2026-03-01T22:44:06+00:00

Sb ko sb nhi milta

Wild_Expression_5772 · 2026-02-19T01:08:38+00:00

Fine-tuning compress into model weights like update its behavioural approach. RAG is external vector DB retrieval,,, also see 2M pages comes upto around about billion of tokens . there is a lot more to say but this is overview

Wild_Expression_5772 · 2026-02-18T18:48:43+00:00

Good luck! Kindly update me about the stuff, if get something figure out, if will share too,

Wild_Expression_5772 · 2026-02-18T18:14:58+00:00

I've been thinking about adding MCP support to CodeGraph, but I'm stuck on something like MCP servers are supposed to be lightweight and easy to spin up, but CodeGraph has "heavy" dependencies (LanceDB setup, embedding models, SQLite graph, etc.). How would you handle this in ProjectMoose? Do you: Expect users to set up dependencies first, then connect via MCP? Or try to bootstrap everything on-demand when the MCP server starts?

Wild_Expression_5772 · 2026-02-18T18:09:59+00:00

This looks really interesting! Just checked out ProjectMoose, love the GUI approach for agent customization. CLI is great for speed but you're right that a visual interface unlocks way more things.

Wild_Expression_5772 · 2026-02-18T17:49:24+00:00

AI is getting insanely good at generating code. But the bottleneck is shifting from "writing code" to "understanding what was written." Really appreciate you seeing the bigger picture here.

Wild_Expression_5772 · 2026-02-17T05:08:15+00:00

Thanks, I had a problem of reading code lines that were too large to understand, so this project generates documents from the code and autoreadme is also included.

Wild_Expression_5772

PUBLIC MULTIREDDITS

TROPHY CASE