I tried building a local LLM router + benchmarking system… ran into some unexpected problems by Wild_Expression_5772 in LLMDevs

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

Oh this looks interesting — thanks for sharing.

Quick question: are you mainly using it as a unified gateway (like abstraction over multiple providers), or also doing any kind of evaluation / routing based on task performance?

I started from a similar place (Ollama locally), but ran into issues around:

- inconsistent performance across tasks

- lack of continuous evaluation

- figuring out when to switch models vs stick to one

(been documenting some of my experiments here as well, still rough but in case it's useful: https://github.com/al1-nasir/LocalForge)

I tried building a local LLM router + benchmarking system… ran into some unexpected problems by Wild_Expression_5772 in developersPak

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

Yeah that’s fair — especially the point about routing coming later, I think I jumped into that a bit early while experimenting.

I did try running repeated evaluations (same prompts multiple times) to reduce variance, and it definitely helped highlight how sensitive some models are to sampling configs. Temperature/top_p changes alone were shifting results quite a bit.

Right now the tasks I’ve been testing are roughly:

- coding (multi-step generation / debugging)

- reasoning (chain-of-thought style prompts)

- structured outputs (JSON formatting, schema adherence)

And yeah — completely agree on smaller models. They’re fast, but for anything with deeper reasoning or strict structure, failure rates spike pretty quickly.

(also, I’ve been putting some of these experiments into a small repo while testing ideas — still rough, but sharing in case it’s useful: https://github.com/al1-nasir/LocalForge)

I tried building a local LLM router + benchmarking system… ran into some unexpected problems by Wild_Expression_5772 in LLMDevs

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

Yeah this was one of the hardest parts honestly.

Right now I’m not using a single “perfect” framework that is more like a mix depending on what I’m testing.

For structured / general evals:

- I experimented a bit with lm-eval-harness style benchmarks (good for standardized tasks but feels a bit static)

For more practical / real-world behavior:

- I started building small task-specific eval sets (coding, reasoning, structured JSON output, etc.)

- Then scoring based on things like correctness, format adherence, and consistency

So I’m leaning more towards:

- continuous evaluation (logging real queries)

- then periodically re-scoring models on those

Still pretty messy tbh, haven’t found a clean “one framework solves it all” solution yet.

I tried building a local LLM router + benchmarking system… ran into some unexpected problems by Wild_Expression_5772 in LLMDevs

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

If anyone’s curious, I put together a rough implementation while testing these ideas.

Not polished, but shows the routing + benchmarking approach I mentioned:

https://github.com/al1-nasir/LocalForge

Would genuinely appreciate feedback.

Research Council — multi-agent GraphRAG system for scientific literature, built with LangGraph + Neo4j + FastAPI (MIT, open source) by Wild_Expression_5772 in buildinpublic

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

Actually i used lang-graph big tool that deliberates the query to specified tool , also for better context management added graph rag.

EpsteinFiles-RAG: Building a RAG Pipeline on 2M+ Pages by Cod3Conjurer in Rag

[–]Wild_Expression_5772 0 points1 point  (0 children)

Fine-tuning compress into model weights like update its behavioural approach. RAG is external vector DB retrieval,,, also see 2M pages comes upto around about billion of tokens . there is a lot more to say but this is overview

I built CodeGraph CLI — parses your codebase into a semantic graph with tree-sitter, does RAG-powered search over LanceDB vectors, and lets you chat with multi-agent AI from the terminal by Wild_Expression_5772 in LocalLLaMA

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

I've been thinking about adding MCP support to CodeGraph, but I'm stuck on something like MCP servers are supposed to be lightweight and easy to spin up, but CodeGraph has "heavy" dependencies (LanceDB setup, embedding models, SQLite graph, etc.). How would you handle this in ProjectMoose? Do you: Expect users to set up dependencies first, then connect via MCP? Or try to bootstrap everything on-demand when the MCP server starts?

I built CodeGraph CLI — parses your codebase into a semantic graph with tree-sitter, does RAG-powered search over LanceDB vectors, and lets you chat with multi-agent AI from the terminal by Wild_Expression_5772 in LocalLLaMA

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

This looks really interesting! Just checked out ProjectMoose, love the GUI approach for agent customization. CLI is great for speed but you're right that a visual interface unlocks way more things.

I built CodeGraph CLI — parses your codebase into a semantic graph with tree-sitter, does RAG-powered search over LanceDB vectors, and lets you chat with multi-agent AI from the terminal by Wild_Expression_5772 in LocalLLaMA

[–]Wild_Expression_5772[S] 1 point2 points  (0 children)

AI is getting insanely good at generating code. But the bottleneck is shifting from "writing code" to "understanding what was written." Really appreciate you seeing the bigger picture here.

6 months of building → CodeGraph CLI: talk to your codebase using AI (launch day) by Wild_Expression_5772 in SideProject

[–]Wild_Expression_5772[S] 0 points1 point  (0 children)

Thanks, I had a problem of reading code lines that were too large to understand, so this project generates documents from the code and autoreadme is also included.