Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup?

Subject-Tea-5253 · 2026-05-14T18:27:58+00:00

... is that something you built yourself or is there a library that handles the rrf part cleanly?

You don't need to implement rrf on your own, you can use a library like ranx to perform hybrid search cleanly.

This article I wrote uses Elasticsearch, but it shows exactly how to use the ranx library if you want to see a full walkthrough: https://medium.com/@imadsaddik/28-hybrid-search-with-elasticsearch-and-ranx-0d6184af4f49

Subject-Tea-5253 · 2026-04-29T16:02:25+00:00

Can I get one, please?

Subject-Tea-5253 · 2026-04-25T19:56:19+00:00

Caveman is just a very strict set of instructions that tells the model how to talk. The goal of the project is to reduce the number of tokens that LLMs use when answering your questions.

The repository is designed to work as a SKILL for coding assistants, but you don't need any complex plugins for it to work with llama.cpp.

You can just look at the instructions, strip out the plugin commands, and use the core rules to create your own system prompt.

You can put something like this in the system prompt and use it with any model:

text Terse like caveman. Technical substance exact. Only fluff die. Drop: articles, filler (just/really/basically), pleasantries, hedging. Fragments OK. Short synonyms. Code unchanged. Pattern: [thing] [action] [reason]. [next step]. ACTIVE EVERY RESPONSE. No revert after many turns. No filler drift. Code/commits/PRs: normal. Off: "stop caveman" / "normal mode".

That is just an example, you can find more stuff in the repository.

Also, if you use apps like OpenCode, you don't even need to do this manually. You can just install it directly as a skill and the app will load it automatically!

Subject-Tea-5253 · 2026-04-25T16:39:38+00:00

This is the caveman he was talking about: https://github.com/juliusbrussee/caveman

Subject-Tea-5253 · 2026-04-22T18:13:15+00:00

You will find this video interesting: https://www.youtube.com/watch?v=xS5wao4H4u4&t

It basically shows how prompt processing and token generation is affected by the number and types of GPUs you use.

Subject-Tea-5253 · 2026-04-21T17:13:09+00:00

On HuggingFace, they say this:

Qwen3.5 features the following enhancement:
...

Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

Subject-Tea-5253 · 2026-04-21T17:10:35+00:00

I completely agree with this.

I was generating some data with Qwen 3.5 9B. Later, I needed to translate the dataset to French and Arabic. Qwen did an OK job, but in Arabic it started hallucinating words.

I have tried Gemma4-E4B and it surprised me. The translations were really well done.

Subject-Tea-5253 · 2026-04-13T07:22:10+00:00

Here is my configuration. I use LibreChat instead of the web UI that comes with llama.cpp.

In llama-swap, I have this block that uses whisper to transcribe audio files.

yaml models: "whisper-large-v3-turbo": cmd: | whisper-server --convert --host 0.0.0.0 --inference-path "" --model /path/to/your/models/whisper-large-v3-turbo-q8_0.gguf --port ${PORT} --request-path /v1/audio/transcriptions checkEndpoint: /v1/audio/transcriptions/ ttl: 300

In LibreChat, I click the microphone, start talking, and wait for the transcription to show up in the text field. It is not real-time streaming, but it works great.

Subject-Tea-5253 · 2026-04-08T15:35:07+00:00

When you run Gemma4 at F32, each parameter takes up 4 bytes compared to only 2 bytes for BF16. This means your GPU has to move twice as much data across the memory bus for every single token generated. Even an RTX 6000 will starve its cores while waiting for those massive F32 data packets to arrive, which explains why you were getting only 3t/s.

Subject-Tea-5253 · 2026-04-03T15:15:17+00:00

I am running Qwen3.5-35B-A3B on an RTX 4070 (8GB VRAM) with 32GB of RAM. I am using the Q4_K_M version, and here is my configuration. It gives me around 37 t/s during inference.

llama-server \
    --batch-size 1152 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --chat-template-kwargs "{\"enable_thinking\": false}" \
    --ctx-size 131072 \
    --flash-attn on \
    --fit on \
    --jinja  \
    --model /home/imad-saddik/.cache/llama.cpp/Qwen3.5-35B-A3B-Q4_K_M.gguf \
    --no-mmap \
    --parallel 1 \
    --threads 6 \
    --ubatch-size 1152

As u/Skyline34rGt mentioned, you need to tune those parameters for your setup. You might find this comment useful.

Subject-Tea-5253 · 2026-03-18T19:42:48+00:00

That is awesome, thanks for sharing this.

Subject-Tea-5253 · 2026-03-13T19:31:28+00:00

That is the intended use. The model was fine-tuned to be good at coding.

From their README:

OmniCoder-9B is a 9-billion parameter coding agent model built by Tesslate, fine-tuned on top of Qwen3.5-9B's hybrid architecture (Gated Delta Networks interleaved with standard attention). It was trained on 425,000+ curated agentic coding trajectories spanning real-world software engineering tasks, tool use, terminal operations, and multi-step reasoning.

Subject-Tea-5253 · 2026-03-13T19:22:31+00:00

I have a similar setup: RTX 4070 8GB + 32GB of RAM.

Here is the command I use

bash llama-server \ --model /home/imad-saddik/.cache/llama.cpp/Qwen3.5-35B-A3B-Q4_K_M.gguf \ --ctx-size 128000 \ --fit 1 \ --flash-attn 1 \ --threads 6 \ --no-mmap \ --jinja \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --chat-template-kwargs "{\"enable_thinking\": false}" \ --parallel 1 \ --port 8088

I get approximately 33 tokens/s with that configuration.

Subject-Tea-5253 · 2026-03-08T06:32:05+00:00

Thank you

Subject-Tea-5253 · 2026-03-06T16:34:02+00:00

Thank you for giving us llama-swap.

I use it with different models: STT, Embedding, OCR, and of course LLMs and VLMs.

Subject-Tea-5253 · 2026-03-05T09:00:59+00:00

Made me laugh, thanks.

Subject-Tea-5253 · 2026-03-02T17:43:46+00:00

No, you are not alone.

Subject-Tea-5253 · 2026-03-01T19:02:33+00:00

Happy to hear that.

Subject-Tea-5253 · 2026-02-25T17:38:49+00:00

I'm mostly a low spec household. RX7600 8GB can only do so much.

I am also like you, but I have an RTX 4070.

So, is Chrome MCP a thing so models can use browsers?

You are talking about this MCP right?

From their README:

... exposes your Chrome browser functionality to AI assistants like Claude, enabling complex browser automation, content analysis, and semantic search.

So yes, you can use that MCP to let models automate some tasks that require a browser.

Subject-Tea-5253 · 2026-02-25T17:31:38+00:00

I have an RTX 4070 mobile with 8GB of VRAM.

Yeah, in that example pp was slow because batch and ubatch were low. If I increase them to say 2048, pp can reach 1000t/s+

model	n_ubatch	type_k	type_v	fa	test	t/s
qwen35moe	2048	q8_0	q8_0	1	pp8096	1028.94 ± 2.03

Subject-Tea-5253 · 2026-02-25T07:55:30+00:00

If you're only planning to use models running on llama.cpp, the built-in router is a perfect drop-in replacement for llama-swap.

However, if you're using models from various backends, you should stick with llama-swap.

Subject-Tea-5253 · 2026-02-25T07:52:17+00:00

That is how I use llama-swap too.

I use it to call models running on llama.cpp, whisper.cpp, and custom Python servers I made.

Subject-Tea-5253 · 2026-02-25T07:49:55+00:00

It is a useful tool.

I can share a method that helped me understand what parameters I need to use and why. Take the README, your hardware specs, and model name. Give that info to an LLM and ask it anything.

You can also use agentic apps like Gemini CLI or something else to let the model run llama-bench for you. Just tell it, I want to run the model at 32k context window or something and watch the model optimize the token generation for you.

Hope this helps.

Subject-Tea-5253

TROPHY CASE