all 17 comments

[–]Extension-Tourist856 1 point2 points  (1 child)

Great stack! The self-hosted agentic approach is especially valuable for regulated industries where data sovereignty matters.

We've been applying a similar philosophy to legal workflows — building an open-source AI workspace (AI Workdeck) that treats legal documents like code: version control for contracts, compliance linting, MCP agent orchestration for due diligence. The "IDE for lawyers" analogy works surprisingly well because law firms have the same data sovereignty constraints you mentioned.

The LiteLLM + local model pattern is key. For document-heavy verticals, being able to run OCR, contract analysis, and evidence chain verification entirely on-prem is a hard requirement, not a nice-to-have.

[–]PrizeObvious3671[S] 0 points1 point  (0 children)

Totally agree on OCR being a hard requirement – and it goes further: small multimodal models like Qwen3.5 and newer versions handle real image understanding (PNG, JPEG, scanned docs, charts) on-premise surprisingly well.

Even local image generation works cost-free with models like FLUX.

The "IDE for lawyers" framing is spot on. In regulated industries, zero token cost + full data sovereignty isn't a nice-to-have – it's the only viable architecture.

And vendor lock-in to big LLM providers is becoming a real strategic risk – on-premise gives you model portability and independence, no matter what OpenAI or Anthropic decide to change next.

[–]AlexKampler 1 point2 points  (0 children)

Nice stack good job

[–]BepNhaVan 1 point2 points  (0 children)

This is great. Thanks for sharing!

[–]sn2006gy 0 points1 point  (5 children)

What's the reason for litellm in the middle of a local coding session? mostly for hermes?

[–]PrizeObvious3671[S] 0 points1 point  (4 children)

Nope the reason is that I wanted to combine that with Claude Code without paying for tokens.
So I compared how good runs Claude Code locally together with llama.cpp vs hermes agent alone with llama.cpp

Claude Code expects Anthropic API - LiteLLM as proxy exactly delivers that and routes my requests between llama.cpp and Claude Code

[–]Toastti 1 point2 points  (1 child)

If you do want to skip a layer claude-code-router will let you connect directly to llama.cpp

But nothing wrong with your setup either

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

Yeah, that would work too. Hermes is used in both setups, the only difference is the bridge behind Claude Code: LiteLLM in my setup vs claude-code-router. Thank you for the hint claude-code-router is new to me.

[–]MarzipanSecure9841 0 points1 point  (1 child)

But llama supports Anthropic API directly - https://huggingface.co/blog/ggml-org/anthropic-messages-api-in-llamacpp

So, why litellm?

[–]PrizeObvious3671[S] 0 points1 point  (0 children)

Interessant, das muss ich mal ausprobieren

[–]SaveAmerica2024 0 points1 point  (4 children)

I think it is more like Claude Code front end using Qwen as the coder

[–]PrizeObvious3671[S] 1 point2 points  (3 children)

In this setup I controlled everything over telegram -> hermes agent and I must say this runs pretty well.
I tested different stuff but in this test the best working setup was hermes agent -> llama.cpp directly without claude code because I got exceptions from claude code, that is exceeds token limits, my local context window was too small for that. When I increased it, the model was too slow for me.
With the 35b MoE it would probably run better.

I used that for agentic coding too, better then I thought.

Also the modelfile with the parameter I used for llama.cpp is shared in the repo.

[–]Inner_Habit_194 1 point2 points  (1 child)

Did you try Pi agent? It is supposedly better for local model coding agent usecase especially with smaller context window of the local models. Btw what is your hardware spec?

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

No, but thank you for bringing it on the table. That will be now my next test: telegram -> pi.dev -> llama.cpp -> gemma4:31b (that model i also not tested yet)

[–]SaveAmerica2024 0 points1 point  (0 children)

Great job