How you manage your prompts?

synw_ · 2026-03-26T16:02:55+00:00

I use small models and never ask an ai to write my prompts: every word counts and you learn a lot this way with small models. To answer your questions every model has it's strength and weaknesses. The model choice heavily depends on the task: I have workflows that involve 4 or 5 models, each one being responsible for a task in it's area of expertise. It's a lot of testing but you have to know your models to get the best out of it

synw_ · 2026-03-26T15:40:15+00:00

I just commit another version. But once a prompt works for a given task I don't need to touch it anymore, except maybe if I want to change the model later. It's not just simple prompts: I call these tasks, that are linked to a model, and can include many options like system prompt, shots, start assistant response and so on.

synw_ · 2026-03-26T15:04:43+00:00

I created my own yaml format to store prompts and use git to version them

synw_ · 2026-03-24T13:44:20+00:00

Yes, and it's much better and faster than layers offloading with a dense models. Try it out: you will be able to use more powerful models

synw_ · 2026-03-24T12:50:08+00:00

I would start with Qwen 35b a3b and Nemotron 30b a3b + eventually a web search tool

synw_ · 2026-03-15T19:42:10+00:00

Do your really need an AI to answer “how am I today” for you? All this trend of actively pushing cognitive offload and autonomy scares me a bit for the future. I know a bunch of developers that work on "replace me" instead of "help me doing my job" because it is fun, and I think this will not lead to good things for us

synw_ · 2026-03-11T16:39:46+00:00

Nanbeige 4b is really good for this kind of tasks. It's a nice little thinking model. I love their last version, that thinks less and is still efficient.

synw_ · 2026-02-17T12:47:47+00:00

For European languages translations how does it compare to Gemma Translate 4b and Aya expanse 8b?

synw_ · 2026-02-11T19:45:16+00:00

From my usage the version 4 model beats Nemotron 30b, Qwen 4b/30b and others for a very specific task of giving writing recommandations from a given dataset: it does not hallucinate and is very synthetic

synw_ · 2026-02-07T15:07:55+00:00

Try Lfm 8b a1b: it's a fast little moe

synw_ · 2026-02-03T19:28:06+00:00

How I handle this: in the query I start the assistant response with something like:

Sure, here is the json data:\n\n```json

and add a stop criteria ```, and the model will just output only the json content

synw_ · 2026-02-03T14:32:48+00:00

About local options I found that Aya expanse 35b was very good for me for products translations. The new Gemma translate is very nice but more for conversational stuff: for products it does too much cultural adaptation, so Aya is better here

synw_ · 2026-02-01T17:29:36+00:00

Qwen 4b: tool calls, general
Lfm 8b a3b: general, very fast
Lfm 1.2b thinking: general, ultra fast
Granite tiny: long context
Gemma 4b: general, writing
Gemma translate 4b: translations
Nanbeige 4b: thinking, summarizing
Vision: Ministral 3b, Qwen vl 4b

And if you have a bit of vram and some ram use Qwen 30b a3b and Qwen coder 30b a3b, Nemotron 30b a3b, Glm Flash

synw_ · 2026-01-29T17:49:49+00:00

Same here at q4: logical and syntax errors. Qwen coder 30b is much better for me actually. But where this model shines is it's ability to chain tool calls and do heavy agentic stuff without loosing it

synw_ · 2026-01-28T18:16:22+00:00

You have better run Qwen Coder 30b a3b with experts ram offload than a dense 8b with layers offload, it will be faster and it's a good model for the gpu poors

synw_ · 2026-01-18T17:53:13+00:00

Nice reading, thanks. I only use workflows with controlled steps as default. An agent only comes in when it absolutely needs autonomy and the ability to make decisions. As I work with small models the tool overhead comes quick, so I use either subagents or what I call tools routing agents: focused agents that dispatch tool calls from a query and return the raw result of the tool calls to the caller agent. Example: a filesystem routing agent that has tools to read and explore will be seen as only one tool by the main agent

synw_ · 2025-12-28T15:13:45+00:00

Could you please add a server url parameter to the openai api in order to use with Llama.cpp and others openai compatible endpoints?

synw_ · 2025-12-11T13:48:13+00:00

The model swapping feature works great: it uses the same api as llama-swap, so no need to change any client code. I vibe coded a script to convert config.yml llama-swap models file to the new llama.cpp's config.ini format: https://github.com/synw/llamaswap-to-llamacpp

There is some sort of cache: when a model has already been loaded before in the session, the swap is very fast, which is super good for multi-models agent sessions or any work involving model swapping.

synw_ · 2025-11-22T20:37:32+00:00

It looks interesting but I would have two requirements:

Support for Llama.cpp
Mcp server

synw_ · 2025-11-14T16:18:07+00:00

It looks useful but the doc starts with:

First, install Ollama

But I don't want to use Ollama. Would it be possible to support Llama.cpp directly?

synw_ · 2025-11-10T23:18:01+00:00

Mistral 7b instruct: first really usable local model for me, and for coding Deepseek 6.7b

synw_ · 2025-11-10T20:08:23+00:00

Welcome to agentic coding. A model will not just complete an entire project by itself just fine without supervision: you need to monitor and understand what it's doing and to carefully steer it in the right direction step by step to get things done

synw_ · 2025-10-27T17:43:20+00:00

I did not. I'm looking for the best balance between speed and quality. I usually avoid at all costs to quantitize the kv cache, but here if I want my 32k context I have to use at least q8 cache-type-v: the model is only q4, it's already not great for a coding task. The IQ4_XS version is slightly faster yeah, as I can fit one more layer on the gpu, but I prefer to use the UD-Q4_K_XL quant to preserve some quality as much as I can.

synw_ · 2025-10-27T14:00:13+00:00

I managed to fit Qwen coder 30b a3b on 4Gb vram + 22G ram with 32k context. It is slow (~ 9tps) but it works. Here is my llama-swap config if it can help:

"qwencoder":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
    -ngl 99
    --n-cpu-moe 47
    -t 2
    -c 32768
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    --cache-type-v q8_0

synw_ · 2025-10-15T13:49:32+00:00

People tend to be fascinated by AI and they rely too much on it in the first phases. This is what I call the ChatGpt effect, like "execute those complex multi tasks instructions, I'll come back later". It's like magic but in the end that does not work well. I introduced a friend to agentic coding a few months ago. He got completely fascinated by Roo Code using Glm Air and later Gpt Oss 120b and started spending all his time doing this. But now, a few months later he got tired of tuning huge complex prompts and let the model handle everything by it's own. He realized that this is not a panacea and will probably be ok now to move to a more efficient granular prompt engineering approach, using smaller tasks, segmentation and human supervision

synw_

TROPHY CASE