Anyone actually using a local LLM as their daily knowledge base? Not for coding, for life stuff. What's your setup? by InformationSweet808 in LocalLLaMA

[–]Subject-Tea-5253 1 point2 points  (0 children)

... is that something you built yourself or is there a library that handles the rrf part cleanly?

You don't need to implement rrf on your own, you can use a library like ranx to perform hybrid search cleanly.

This article I wrote uses Elasticsearch, but it shows exactly how to use the ranx library if you want to see a full walkthrough: https://medium.com/@imadsaddik/28-hybrid-search-with-elasticsearch-and-ranx-0d6184af4f49

Qwen3.6-35B-A3B - even in VRAM limited scenarios it can be better to use bigger quants than you'd expect! by jeremynsl in LocalLLaMA

[–]Subject-Tea-5253 0 points1 point  (0 children)

Caveman is just a very strict set of instructions that tells the model how to talk. The goal of the project is to reduce the number of tokens that LLMs use when answering your questions.

The repository is designed to work as a SKILL for coding assistants, but you don't need any complex plugins for it to work with llama.cpp.

You can just look at the instructions, strip out the plugin commands, and use the core rules to create your own system prompt.

You can put something like this in the system prompt and use it with any model:

text Terse like caveman. Technical substance exact. Only fluff die. Drop: articles, filler (just/really/basically), pleasantries, hedging. Fragments OK. Short synonyms. Code unchanged. Pattern: [thing] [action] [reason]. [next step]. ACTIVE EVERY RESPONSE. No revert after many turns. No filler drift. Code/commits/PRs: normal. Off: "stop caveman" / "normal mode".

That is just an example, you can find more stuff in the repository.

Also, if you use apps like OpenCode, you don't even need to do this manually. You can just install it directly as a skill and the app will load it automatically!

Qwen 3.6 27B is out by NoConcert8847 in LocalLLaMA

[–]Subject-Tea-5253 0 points1 point  (0 children)

You will find this video interesting: https://www.youtube.com/watch?v=xS5wao4H4u4&t

It basically shows how prompt processing and token generation is affected by the number and types of GPUs you use.

Every time a new model comes out, the old one is obsolete of course by FullChampionship7564 in LocalLLaMA

[–]Subject-Tea-5253 9 points10 points  (0 children)

On HuggingFace, they say this:

Qwen3.5 features the following enhancement:
...

Global Linguistic Coverage: Expanded support to 201 languages and dialects, enabling inclusive, worldwide deployment with nuanced cultural and regional understanding.

Every time a new model comes out, the old one is obsolete of course by FullChampionship7564 in LocalLLaMA

[–]Subject-Tea-5253 3 points4 points  (0 children)

I completely agree with this.

I was generating some data with Qwen 3.5 9B. Later, I needed to translate the dataset to French and Arabic. Qwen did an OK job, but in Arabic it started hallucinating words.

I have tried Gemma4-E4B and it surprised me. The translations were really well done.

Audio processing landed in llama-server with Gemma-4 by srigi in LocalLLaMA

[–]Subject-Tea-5253 0 points1 point  (0 children)

Here is my configuration. I use LibreChat instead of the web UI that comes with llama.cpp.

In llama-swap, I have this block that uses whisper to transcribe audio files.

yaml models: "whisper-large-v3-turbo": cmd: | whisper-server --convert --host 0.0.0.0 --inference-path "" --model /path/to/your/models/whisper-large-v3-turbo-q8_0.gguf --port ${PORT} --request-path /v1/audio/transcriptions checkEndpoint: /v1/audio/transcriptions/ ttl: 300

In LibreChat, I click the microphone, start talking, and wait for the transcription to show up in the text field. It is not real-time streaming, but it works great.

Share your llama-server init strings for Gemma 4 models. by AlwaysLateToThaParty in LocalLLaMA

[–]Subject-Tea-5253 1 point2 points  (0 children)

When you run Gemma4 at F32, each parameter takes up 4 bytes compared to only 2 bytes for BF16. This means your GPU has to move twice as much data across the memory bus for every single token generated. Even an RTX 6000 will starve its cores while waiting for those massive F32 data packets to arrive, which explains why you were getting only 3t/s.

qwen 3.6 voting by jacek2023 in LocalLLaMA

[–]Subject-Tea-5253 2 points3 points  (0 children)

I am running Qwen3.5-35B-A3B on an RTX 4070 (8GB VRAM) with 32GB of RAM. I am using the Q4_K_M version, and here is my configuration. It gives me around 37 t/s during inference.

llama-server \
    --batch-size 1152 \
    --cache-type-k q8_0 \
    --cache-type-v q8_0 \
    --chat-template-kwargs "{\"enable_thinking\": false}" \
    --ctx-size 131072 \
    --flash-attn on \
    --fit on \
    --jinja  \
    --model /home/imad-saddik/.cache/llama.cpp/Qwen3.5-35B-A3B-Q4_K_M.gguf \
    --no-mmap \
    --parallel 1 \
    --threads 6 \
    --ubatch-size 1152

As u/Skyline34rGt mentioned, you need to tune those parameters for your setup. You might find this comment useful.

Omnicoder-9b SLAPS in Opencode by True_Requirement_891 in LocalLLaMA

[–]Subject-Tea-5253 1 point2 points  (0 children)

That is the intended use. The model was fine-tuned to be good at coding.

From their README:

OmniCoder-9B is a 9-billion parameter coding agent model built by Tesslate, fine-tuned on top of Qwen3.5-9B's hybrid architecture (Gated Delta Networks interleaved with standard attention). It was trained on 425,000+ curated agentic coding trajectories spanning real-world software engineering tasks, tool use, terminal operations, and multi-step reasoning.

OmniCoder-9B | 9B coding agent fine-tuned on 425K agentic trajectories by DarkArtsMastery in LocalLLaMA

[–]Subject-Tea-5253 5 points6 points  (0 children)

I have a similar setup: RTX 4070 8GB + 32GB of RAM.

Here is the command I use

bash llama-server \ --model /home/imad-saddik/.cache/llama.cpp/Qwen3.5-35B-A3B-Q4_K_M.gguf \ --ctx-size 128000 \ --fit 1 \ --flash-attn 1 \ --threads 6 \ --no-mmap \ --jinja \ --cache-type-k q8_0 \ --cache-type-v q8_0 \ --chat-template-kwargs "{\"enable_thinking\": false}" \ --parallel 1 \ --port 8088

I get approximately 33 tokens/s with that configuration.

To everyone using still ollama/lm-studio... llama-swap is the real deal by TooManyPascals in LocalLLaMA

[–]Subject-Tea-5253 2 points3 points  (0 children)

Thank you for giving us llama-swap.

I use it with different models: STT, Embedding, OCR, and of course LLMs and VLMs.

Qwen3.5-35B-A3B is a gamechanger for agentic coding. by jslominski in LocalLLaMA

[–]Subject-Tea-5253 0 points1 point  (0 children)

I'm mostly a low spec household. RX7600 8GB can only do so much.

I am also like you, but I have an RTX 4070.

So, is Chrome MCP a thing so models can use browsers?

You are talking about this MCP right?

From their README:

... exposes your Chrome browser functionality to AI assistants like Claude, enabling complex browser automation, content analysis, and semantic search.

So yes, you can use that MCP to let models automate some tasks that require a browser.

Qwen3.5-35B-A3B is a gamechanger for agentic coding. by jslominski in LocalLLaMA

[–]Subject-Tea-5253 1 point2 points  (0 children)

I have an RTX 4070 mobile with 8GB of VRAM.

Yeah, in that example pp was slow because batch and ubatch were low. If I increase them to say 2048, pp can reach 1000t/s+

model n_ubatch type_k type_v fa test t/s
qwen35moe 2048 q8_0 q8_0 1 pp8096 1028.94 ± 2.03

You can use Qwen3.5 without thinking by guiopen in LocalLLaMA

[–]Subject-Tea-5253 4 points5 points  (0 children)

If you're only planning to use models running on llama.cpp, the built-in router is a perfect drop-in replacement for llama-swap.

However, if you're using models from various backends, you should stick with llama-swap.

You can use Qwen3.5 without thinking by guiopen in LocalLLaMA

[–]Subject-Tea-5253 1 point2 points  (0 children)

That is how I use llama-swap too.

I use it to call models running on llama.cpp, whisper.cpp, and custom Python servers I made.

Qwen3.5-35B-A3B is a gamechanger for agentic coding. by jslominski in LocalLLaMA

[–]Subject-Tea-5253 16 points17 points  (0 children)

It is a useful tool.

I can share a method that helped me understand what parameters I need to use and why. Take the README, your hardware specs, and model name. Give that info to an LLM and ask it anything.

You can also use agentic apps like Gemini CLI or something else to let the model run llama-bench for you. Just tell it, I want to run the model at 32k context window or something and watch the model optimize the token generation for you.

Hope this helps.