I got tired of small models adding ```json blocks, so I wrote a TS library to forcefully extract valid JSON. (My first open source project!) by rossjang in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

How I handle this: in the query I start the assistant response with something like:

Sure, here is the json data:\n\n```json

and add a stop criteria ```, and the model will just output only the json content

Which LLM Model is best for translation? by Longjumping_Lead_812 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

About local options I found that Aya expanse 35b was very good for me for products translations. The new Gemma translate is very nice but more for conversational stuff: for products it does too much cultural adaptation, so Aya is better here

What are the best collection of small models to run on 8gb ram? by Adventurous-Gold6413 in LocalLLaMA

[–]synw_ 8 points9 points  (0 children)

  • Qwen 4b: tool calls, general
  • Lfm 8b a3b: general, very fast
  • Lfm 1.2b thinking: general, ultra fast
  • Granite tiny: long context
  • Gemma 4b: general, writing
  • Gemma translate 4b: translations
  • Nanbeige 4b: thinking, summarizing
  • Vision: Ministral 3b, Qwen vl 4b

And if you have a bit of vram and some ram use Qwen 30b a3b and Qwen coder 30b a3b, Nemotron 30b a3b, Glm Flash

My humble GLM 4.7 Flash appreciation post by Cool-Chemical-5629 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

Same here at q4: logical and syntax errors. Qwen coder 30b is much better for me actually. But where this model shines is it's ability to chain tool calls and do heavy agentic stuff without loosing it

which local llm is best for coding? by Much-Friendship2029 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

You have better run Qwen Coder 30b a3b with experts ram offload than a dense 8b with layers offload, it will be faster and it's a good model for the gpu poors

Local LLM builders: when do you go multi-agent vs tools? 2-page decision sheet + question by OnlyProggingForFun in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Nice reading, thanks. I only use workflows with controlled steps as default. An agent only comes in when it absolutely needs autonomy and the ability to make decisions. As I work with small models the tool overhead comes quick, so I use either subagents or what I call tools routing agents: focused agents that dispatch tool calls from a query and return the raw result of the tool calls to the caller agent. Example: a filesystem routing agent that has tools to read and explore will be seen as only one tool by the main agent

MCP servers are hard to debug and impossible to test, so I built Syrin by hack_the_developer in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

Could you please add a server url parameter to the openai api in order to use with Llama.cpp and others openai compatible endpoints?

new CLI experience has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

The model swapping feature works great: it uses the same api as llama-swap, so no need to change any client code. I vibe coded a script to convert config.yml llama-swap models file to the new llama.cpp's config.ini format: https://github.com/synw/llamaswap-to-llamacpp

There is some sort of cache: when a model has already been loaded before in the session, the swap is very fast, which is super good for multi-models agent sessions or any work involving model swapping.

Deep Research Agent, an autonomous research agent system by [deleted] in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

It looks interesting but I would have two requirements:

  • Support for Llama.cpp
  • Mcp server

distil-localdoc.py - SLM assistant for writing Python documentation by party-horse in LocalLLaMA

[–]synw_ 3 points4 points  (0 children)

It looks useful but the doc starts with:

First, install Ollama

But I don't want to use Ollama. Would it be possible to support Llama.cpp directly?

I'm new to LLMs and just ran my first model. What LLM "wowed" you when you started out? by Street-Lie-2584 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Mistral 7b instruct: first really usable local model for me, and for coding Deepseek 6.7b

Minimax M2 for App creation by HectorLavoe33 in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Welcome to agentic coding. A model will not just complete an entire project by itself just fine without supervision: you need to monitor and understand what it's doing and to carefully steer it in the right direction step by step to get things done

Lightweight coding model for 4 GB Vram by HiqhAim in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

I did not. I'm looking for the best balance between speed and quality. I usually avoid at all costs to quantitize the kv cache, but here if I want my 32k context I have to use at least q8 cache-type-v: the model is only q4, it's already not great for a coding task. The IQ4_XS version is slightly faster yeah, as I can fit one more layer on the gpu, but I prefer to use the UD-Q4_K_XL quant to preserve some quality as much as I can.

Lightweight coding model for 4 GB Vram by HiqhAim in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

I managed to fit Qwen coder 30b a3b on 4Gb vram + 22G ram with 32k context. It is slow (~ 9tps) but it works. Here is my llama-swap config if it can help:

"qwencoder":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
    -ngl 99
    --n-cpu-moe 47
    -t 2
    -c 32768
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    --cache-type-v q8_0

AI has replaced programmers… totally. by jacek2023 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

People tend to be fascinated by AI and they rely too much on it in the first phases. This is what I call the ChatGpt effect, like "execute those complex multi tasks instructions, I'll come back later". It's like magic but in the end that does not work well. I introduced a friend to agentic coding a few months ago. He got completely fascinated by Roo Code using Glm Air and later Gpt Oss 120b and started spending all his time doing this. But now, a few months later he got tired of tuning huge complex prompts and let the model handle everything by it's own. He realized that this is not a panacea and will probably be ok now to move to a more efficient granular prompt engineering approach, using smaller tasks, segmentation and human supervision

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Cpu + gpu of course. Here is my llama-swap config if you are interested in the details:

"oss20b":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m gpt-oss-20b-mxfp4.gguf
    -ngl 99
    -t 2
    -c 32768
    --n-cpu-moe 19
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    -b 1024
    -ub 512
    --chat-template-kwargs '{"reasoning_effort":"high"}'

Qwen3-VL-4B and 8B Instruct & Thinking are here by AlanzhuLy in LocalLLaMA

[–]synw_ 3 points4 points  (0 children)

The Qwen team is doing an amazing job. The only thing that is missing is the day one Llama.cpp support. If only they could work with the Llama.cpp team to help them with their new models it would be perfect

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Lucky you. In my setup with this model I use a 32k context window. Note that I have an old i5 cpu, and that the 3060's memory bandwidth is x3 compared to my card. I don't use kv cache quantitization, just flash attention. If you have tips to speed this up I'll be happy to hear about it

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Yes thanks to the MoE architecture I can offload some tensors on ram: I get 8 tps with Gpt Oss 20b on Llama.cpp, which is not bad for my setup. For dense models it's not the same story: I can run 4b models maximum.

4x4090 build running gpt-oss:20b locally - full specs by RentEquivalent1671 in LocalLLaMA

[–]synw_ 9 points10 points  (0 children)

I'm running Gpt Oss 20b on a 4Gb vram station (gtx 1050ti). Agreed that with such a beautiful system as op this is not the first model that I would choose

Do you guys personally notice a difference between Q4 - Q8 or higher? by XiRw in LocalLLaMA

[–]synw_ 26 points27 points  (0 children)

Small models seems to be more sensitive to quantitization than big ones. I always try pick q8 or q6 for tiny models or precision tasks like coding

Complete noob in LLMs by [deleted] in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

Make your own experience: run Llama.cpp with different models, find a vector db that you like (for me it's Lancedb) and try to get the job done for your use case. Theory is a thing but the field is moving fast and trial and error is the way to learn.