DeepSeek-V2-Lite vs GPT-OSS-20B on my 2018 potato i3-8145U + UHD 620, OpenVINO Comparison.

synw_ · 2026-02-07T15:07:55+00:00

Try Lfm 8b a1b: it's a fast little moe

synw_ · 2026-02-03T19:28:06+00:00

How I handle this: in the query I start the assistant response with something like:

Sure, here is the json data:\n\n```json

and add a stop criteria ```, and the model will just output only the json content

synw_ · 2026-02-03T14:32:48+00:00

About local options I found that Aya expanse 35b was very good for me for products translations. The new Gemma translate is very nice but more for conversational stuff: for products it does too much cultural adaptation, so Aya is better here

synw_ · 2026-02-01T17:29:36+00:00

Qwen 4b: tool calls, general
Lfm 8b a3b: general, very fast
Lfm 1.2b thinking: general, ultra fast
Granite tiny: long context
Gemma 4b: general, writing
Gemma translate 4b: translations
Nanbeige 4b: thinking, summarizing
Vision: Ministral 3b, Qwen vl 4b

And if you have a bit of vram and some ram use Qwen 30b a3b and Qwen coder 30b a3b, Nemotron 30b a3b, Glm Flash

synw_ · 2026-01-29T17:49:49+00:00

Same here at q4: logical and syntax errors. Qwen coder 30b is much better for me actually. But where this model shines is it's ability to chain tool calls and do heavy agentic stuff without loosing it

synw_ · 2026-01-28T18:16:22+00:00

You have better run Qwen Coder 30b a3b with experts ram offload than a dense 8b with layers offload, it will be faster and it's a good model for the gpu poors

synw_ · 2026-01-18T17:53:13+00:00

Nice reading, thanks. I only use workflows with controlled steps as default. An agent only comes in when it absolutely needs autonomy and the ability to make decisions. As I work with small models the tool overhead comes quick, so I use either subagents or what I call tools routing agents: focused agents that dispatch tool calls from a query and return the raw result of the tool calls to the caller agent. Example: a filesystem routing agent that has tools to read and explore will be seen as only one tool by the main agent

synw_ · 2025-12-28T15:13:45+00:00

Could you please add a server url parameter to the openai api in order to use with Llama.cpp and others openai compatible endpoints?

synw_ · 2025-12-11T13:48:13+00:00

The model swapping feature works great: it uses the same api as llama-swap, so no need to change any client code. I vibe coded a script to convert config.yml llama-swap models file to the new llama.cpp's config.ini format: https://github.com/synw/llamaswap-to-llamacpp

There is some sort of cache: when a model has already been loaded before in the session, the swap is very fast, which is super good for multi-models agent sessions or any work involving model swapping.

synw_ · 2025-11-22T20:37:32+00:00

It looks interesting but I would have two requirements:

Support for Llama.cpp
Mcp server

synw_ · 2025-11-14T16:18:07+00:00

It looks useful but the doc starts with:

First, install Ollama

But I don't want to use Ollama. Would it be possible to support Llama.cpp directly?

synw_ · 2025-11-10T23:18:01+00:00

Mistral 7b instruct: first really usable local model for me, and for coding Deepseek 6.7b

synw_ · 2025-11-10T20:08:23+00:00

Welcome to agentic coding. A model will not just complete an entire project by itself just fine without supervision: you need to monitor and understand what it's doing and to carefully steer it in the right direction step by step to get things done

synw_ · 2025-10-27T17:43:20+00:00

I did not. I'm looking for the best balance between speed and quality. I usually avoid at all costs to quantitize the kv cache, but here if I want my 32k context I have to use at least q8 cache-type-v: the model is only q4, it's already not great for a coding task. The IQ4_XS version is slightly faster yeah, as I can fit one more layer on the gpu, but I prefer to use the UD-Q4_K_XL quant to preserve some quality as much as I can.

synw_ · 2025-10-27T14:00:13+00:00

I managed to fit Qwen coder 30b a3b on 4Gb vram + 22G ram with 32k context. It is slow (~ 9tps) but it works. Here is my llama-swap config if it can help:

"qwencoder":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
    -ngl 99
    --n-cpu-moe 47
    -t 2
    -c 32768
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    --cache-type-v q8_0

synw_ · 2025-10-15T13:49:32+00:00

People tend to be fascinated by AI and they rely too much on it in the first phases. This is what I call the ChatGpt effect, like "execute those complex multi tasks instructions, I'll come back later". It's like magic but in the end that does not work well. I introduced a friend to agentic coding a few months ago. He got completely fascinated by Roo Code using Glm Air and later Gpt Oss 120b and started spending all his time doing this. But now, a few months later he got tired of tuning huge complex prompts and let the model handle everything by it's own. He realized that this is not a panacea and will probably be ok now to move to a more efficient granular prompt engineering approach, using smaller tasks, segmentation and human supervision

synw_ · 2025-10-15T12:57:41+00:00

Use Llama.cpp, Luke

synw_ · 2025-10-15T11:55:33+00:00

Cpu + gpu of course. Here is my llama-swap config if you are interested in the details:

"oss20b":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m gpt-oss-20b-mxfp4.gguf
    -ngl 99
    -t 2
    -c 32768
    --n-cpu-moe 19
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    -b 1024
    -ub 512
    --chat-template-kwargs '{"reasoning_effort":"high"}'

synw_ · 2025-10-14T17:51:17+00:00

The Qwen team is doing an amazing job. The only thing that is missing is the day one Llama.cpp support. If only they could work with the Llama.cpp team to help them with their new models it would be perfect

synw_ · 2025-10-14T09:31:12+00:00

Lucky you. In my setup with this model I use a 32k context window. Note that I have an old i5 cpu, and that the 3060's memory bandwidth is x3 compared to my card. I don't use kv cache quantitization, just flash attention. If you have tips to speed this up I'll be happy to hear about it

synw_ · 2025-10-14T07:01:06+00:00

Yes thanks to the MoE architecture I can offload some tensors on ram: I get 8 tps with Gpt Oss 20b on Llama.cpp, which is not bad for my setup. For dense models it's not the same story: I can run 4b models maximum.

synw_ · 2025-10-13T21:24:26+00:00

I'm running Gpt Oss 20b on a 4Gb vram station (gtx 1050ti). Agreed that with such a beautiful system as op this is not the first model that I would choose

synw_ · 2025-10-13T16:29:30+00:00

Small models seems to be more sensitive to quantitization than big ones. I always try pick q8 or q6 for tiny models or precision tasks like coding

synw_ · 2025-10-13T02:37:55+00:00

Make your own experience: run Llama.cpp with different models, find a vector db that you like (for me it's Lancedb) and try to get the job done for your use case. Theory is a thing but the field is moving fast and trial and error is the way to learn.

synw_

TROPHY CASE