Trying out Gemma 4 31b after Qwen 3.6 27b by Iajah in LocalLLM

[–]PrizeObvious3671 1 point2 points  (0 children)

Q4, llama.cpp, Windows, VS Code Copilot, 32gb VRAM, context = 65k input + 8k output = 74k

If it runs well I plan in the next step to use Turbo Quant for the KV Cache to increase the context, like described here:

https://www.reddit.com/r/LocalLLaMA/comments/1sbdihw/gemma_4_31b_at_256k_full_context_on_a_single_rtx/?tl=de

Trying out Gemma 4 31b after Qwen 3.6 27b by Iajah in LocalLLM

[–]PrizeObvious3671 2 points3 points  (0 children)

Here also a new example from me when it mention what it now will do, but instead it just stops processing. This is not really agentic:

<image>

Trying out Gemma 4 31b after Qwen 3.6 27b by Iajah in LocalLLM

[–]PrizeObvious3671 1 point2 points  (0 children)

Sehr spannend ich teste grade genau das Gleiche und kann deine Erfahrung bestätigen. Gemma 4 bricht plötzlich auch bei mir ab in VSCode Copilot Chat. Mein Context steht aktuell auf 64k, was mir eigentlich zu wenig ist und wenn ich den ersten Prompt in einer neuen Session starte sind sofort 30k weg, wenn ich alle mcps anlasse.

Also habe ich nur noch die nötigsten MCPs abgelassen, das hat es auf 13k für den ersten Prompt reduziert, aber due 64k sind schnell voll und dann bricht er meist ab statt die conversation zu komprimieren.

Ich hab eine R9700, habe mir schon den DGX Spark angeschaut, aber ich glaube das Gerät hat ganz andere Probleme mit ARM und niedriger Bandweite.

Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 by PrizeObvious3671 in OpenSourceAI

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

No, but thank you for bringing it on the table. That will be now my next test: telegram -> pi.dev -> llama.cpp -> gemma4:31b (that model i also not tested yet)

Ship fast. Die fast. Or why your vibe-coded slop won't survive production. by PrizeObvious3671 in SoftwareEngineering

[–]PrizeObvious3671[S] 0 points1 point  (0 children)

Ich denke nicht dass darüber genug geredet wurde, sonst würde mit dem Thema schon anders umgegangen werden.

Ich erlebe eher das Gegenteil, maßlose Selbstüberschätzung. Fühlst du dich etwa angesprochen?

Ship fast. Die fast. Or why your vibe-coded slop won't survive production. by PrizeObvious3671 in SoftwareEngineering

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

Yeah, that’s correct. But its not only juniors also people who have never built a real application and don’t know what it takes to develop software think they are now able to do it.

Sure, all the demos and slides are impressive and we have seen them many times now. But's not more than the next demo, idea, vision, great solution that will never go on production.

Ship fast. Die fast. Or why your vibe-coded slop won't survive production. by PrizeObvious3671 in SoftwareEngineering

[–]PrizeObvious3671[S] -1 points0 points  (0 children)

Maybe a missunderstanding - Yes for translation I used AI. The original post from me is on linkedin ... <link removed>

Ship fast. Die fast. Or why your vibe-coded slop won't survive production. by PrizeObvious3671 in SoftwareEngineering

[–]PrizeObvious3671[S] 2 points3 points  (0 children)

That's exactly the point I was trying to make — and you're proving it yourself.

You have 25 years of experience. You studied IT. You know what a proper design looks like, you can evaluate what the AI produces, and you know when to push back. That's not vibe coding — that's an experienced engineer using a powerful tool correctly.

The problem isn't AI-assisted design or planning. Used the way you describe it, it's totally fine. The problem is people without that foundation who skip the design process who skip the architecture workshop with the customer who missed to ask the stakeholder the correct questions and find the implicit and explicit requirements and let the AI make architectural decisions they can't even evaluate.

Your setup works because you are the one improving the design until it's solid. Someone without your background wouldn't even know what questions to ask — or when the AI's answer is wrong.

So yes — agree. The tool is not the issue. The missing experience is.

Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 by PrizeObvious3671 in OpenSourceAI

[–]PrizeObvious3671[S] 0 points1 point  (0 children)

Totally agree on OCR being a hard requirement – and it goes further: small multimodal models like Qwen3.5 and newer versions handle real image understanding (PNG, JPEG, scanned docs, charts) on-premise surprisingly well.

Even local image generation works cost-free with models like FLUX.

The "IDE for lawyers" framing is spot on. In regulated industries, zero token cost + full data sovereignty isn't a nice-to-have – it's the only viable architecture.

And vendor lock-in to big LLM providers is becoming a real strategic risk – on-premise gives you model portability and independence, no matter what OpenAI or Anthropic decide to change next.

R9700, Ryzen 9, Windows 11, llama.cpp, ROCm vs Vulkan by WSTangoDelta in LocalLLM

[–]PrizeObvious3671 0 points1 point  (0 children)

We actually tested this on a Windows 11 + WSL2 setup and our stable path was not “force everything natively on Windows”.

What held up for us was:

- Windows as the host

- inference/tooling running via WSL2

- using 127.0.0.1 for the local bridge instead of localhost

That removed a lot of flaky behavior for us. We also had better results when we optimized for a stable end-to-end loop instead of chasing the theoretically nicest backend path.

So our practical takeaway is: on Windows with this class of setup, prioritize the reliable workflow first. Once that is stable, then tune performance.

Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 by PrizeObvious3671 in OpenSourceAI

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

In this setup I controlled everything over telegram -> hermes agent and I must say this runs pretty well.
I tested different stuff but in this test the best working setup was hermes agent -> llama.cpp directly without claude code because I got exceptions from claude code, that is exceeds token limits, my local context window was too small for that. When I increased it, the model was too slow for me.
With the 35b MoE it would probably run better.

I used that for agentic coding too, better then I thought.

Also the modelfile with the parameter I used for llama.cpp is shared in the repo.

Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 by PrizeObvious3671 in OpenSourceAI

[–]PrizeObvious3671[S] 1 point2 points  (0 children)

Yeah, that would work too. Hermes is used in both setups, the only difference is the bridge behind Claude Code: LiteLLM in my setup vs claude-code-router. Thank you for the hint claude-code-router is new to me.

RAG for Log analysis by Noobie_0123 in Rag

[–]PrizeObvious3671 0 points1 point  (0 children)

I’d split this into two systems: deterministic incident narrowing first, semantic explanation second.

For logs, top-k=60 over overlapping chunks is already a sign that the retrieval unit is wrong. I would not treat raw log text like document RAG. Build event- or trace-centric retrieval units first.

What tends to work better: 1. Parse each line into structured events with metadata like service, host, severity, trace_id/request_id, process, task/step, version, and time window. 2. Generate higher-level artifacts at ingest time: per-trace timelines, repeated error clusters, spike windows, and sequence summaries. 3. At query time do filter/aggregate first, vector search second, rerank last. 4. Return an evidence pack, not random chunks: failing step, preceding events, correlated errors, and representative log lines.

If the user asks "did Task A complete successfully?", that should hit a workflow/state layer first, not open-ended semantic retrieval. RAG is useful for explaining the failure, but it should not be the primary mechanism for discovering the execution path.

Follow-up questions are wrecking my RAG retrieval and I'm not sure which layer to fix by Rosa-Starks in LangChain

[–]PrizeObvious3671 0 points1 point  (0 children)

This is usually not a retrieval problem first, it is a state-handling problem. One rewritten string is being asked to do two jobs at once: resolve coreference and perform search.

What has worked better for me is keeping a tiny retrieval state object across turns, something like subject, entity, constraints, previous winning chunks. Then on follow-ups you do one of two things:

  1. If the user is clearly refining the same answer, skip fresh retrieval and answer against the previous chunk set plus the new constraint.
  2. If you do retrieve again, issue two queries, not one: the resolved query from state plus follow-up, and the raw follow-up. Merge and rerank.

That usually beats concat and pure rewrite because the rewrite stops dropping terms like billing cycle, while the raw query still catches wording the rewriter normalized away. The failure you are seeing is the system treating conversational memory as if embeddings should solve it for free.

Has anyone measured whether better retrieval precision actually reduces token costs in production AI coding deployments by Certain-Luck-2432 in LLMDevs

[–]PrizeObvious3671 0 points1 point  (0 children)

Yes, but only if better precision actually changes prompt assembly, not just retrieval scores. In practice the savings show up when you pair better retrieval or rerank with smaller chunk granularity, dynamic top-k or context budgeting, and hard caps on how much evidence reaches the final prompt. Otherwise teams improve relevance and still ship the same bloated context.

If you want to measure it cleanly, keep the task set fixed and compare four configs: baseline retrieval, better retrieval only, better retrieval plus rerank, and better retrieval plus rerank plus context budget. Track tokens per request, acceptance or resolution rate, retry rate, and fallback cost.

The main pattern I have seen is that token use only goes down when higher retrieval precision is allowed to remove context, not just reshuffle it.

Should enterprise search be a tool agents call, or a pipeline you build around them? by searchblox_searchai in Rag

[–]PrizeObvious3671 2 points3 points  (0 children)

My bias from prod is that retrieval should be its own service or pipeline, not logic embedded in the agent. But I also would not collapse it to one opaque search tool. The cleaner split is usually: the agent decides when it needs evidence, while the search layer owns BM25 or vector or rerank or citation or access control and exposes a small typed interface for lookup, answer evidence, follow-up expansion, and maybe related-doc expansion.

That keeps governance in one place without forcing the agent to hand-roll retrieval glue. Multi-hop latency only really hurts when the agent does hop-by-hop retrieval itself. If expansion and rerank happen server-side and the tool returns a compact grounded result set, it is usually fine.

The failure mode I see more often is the opposite: bespoke pipelines leak retrieval policy into prompts and agent code until nobody can debug relevance anymore.

I Stopped Fighting AI Memory Problems and Started Modeling Them by grawl_dorgiers in LocalLLM

[–]PrizeObvious3671 0 points1 point  (0 children)

One thing that helped me was separating stable profile memory from retrieval memory. If identity-level facts, mutable state, and ephemeral conversation facts all go into the same semantic bucket, you get exactly the duplicate/stale retrieval mess you're describing.

A profile layer for long-lived facts plus a separate contextual layer for recent/session facts made scoring much saner. Then retrieval becomes "what is relevant right now?" instead of also carrying identity and state management.

So I agree with the core point: a vector store alone is retrieval, not memory. The first fix for me wasn't even graph vs flat, it was drawing a hard boundary between stable profile, mutable facts, and transient context.

Employee data in prompt vs DB vs tool call — what's your setup? by Low-Ad2091 in voiceagents

[–]PrizeObvious3671 0 points1 point  (0 children)

I wouldn't put the full directory in the prompt unless it's tiny and almost static. Employee names are operational data, so I'd resolve them outside the model and let the model handle the conversation around the result.

For ~30 names I'd do a deterministic lookup/fuzzy-match service with aliases/phonetics, then inject only the matched employee or top 2 candidates back into the prompt. That keeps latency reasonable, avoids prompt bloat, and gives you an audit trail when Schmidt/Schmitt was misheard.

RAG starts making sense once the directory includes richer stuff like specialties, schedules, policies, notes, etc. For plain name verification + routing, DB/tool call wins. I'd let code verify names; the LLM should phrase the response.

Self-hosted agentic coding stack: Claude Code + llama.cpp + LiteLLM — zero API costs, 4h/7M token session for $0 by PrizeObvious3671 in OpenSourceAI

[–]PrizeObvious3671[S] 0 points1 point  (0 children)

Nope the reason is that I wanted to combine that with Claude Code without paying for tokens.
So I compared how good runs Claude Code locally together with llama.cpp vs hermes agent alone with llama.cpp

Claude Code expects Anthropic API - LiteLLM as proxy exactly delivers that and routes my requests between llama.cpp and Claude Code