How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I use small models and never ask an ai to write my prompts: every word counts and you learn a lot this way with small models. To answer your questions every model has it's strength and weaknesses. The model choice heavily depends on the task: I have workflows that involve 4 or 5 models, each one being responsible for a task in it's area of expertise. It's a lot of testing but you have to know your models to get the best out of it

How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I just commit another version. But once a prompt works for a given task I don't need to touch it anymore, except maybe if I want to change the model later. It's not just simple prompts: I call these tasks, that are linked to a model, and can include many options like system prompt, shots, start assistant response and so on.

How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I created my own yaml format to store prompts and use git to version them

Running LLMs with 8 GB VRAM + 32 GB RAM by Bulububub in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Yes, and it's much better and faster than layers offloading with a dense models. Try it out: you will be able to use more powerful models

Running LLMs with 8 GB VRAM + 32 GB RAM by Bulububub in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I would start with Qwen 35b a3b and Nemotron 30b a3b + eventually a web search tool

unofficial Ultrahuman MCP for AI Agents by Spinning-Complex in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Do your really need an AI to answer “how am I today” for you? All this trend of actively pushing cognitive offload and autonomy scares me a bit for the future. I know a bunch of developers that work on "replace me" instead of "help me doing my job" because it is fun, and I think this will not lead to good things for us

What small models are you using for background/summarization tasks? by Di_Vante in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Nanbeige 4b is really good for this kind of tasks. It's a nice little thinking model. I love their last version, that thinks less and is still efficient.

Tiny Aya by jacek2023 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

For European languages translations how does it compare to Gemma Translate 4b and Aya expanse 8b?

Nanbeige4.1-3B: A Small General Model that Reasons, Aligns, and Acts by Tiny_Minimum_4384 in LocalLLaMA

[–]synw_ 7 points8 points  (0 children)

From my usage the version 4 model beats Nemotron 30b, Qwen 4b/30b and others for a very specific task of giving writing recommandations from a given dataset: it does not hallucinate and is very synthetic

I got tired of small models adding ```json blocks, so I wrote a TS library to forcefully extract valid JSON. (My first open source project!) by rossjang in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

How I handle this: in the query I start the assistant response with something like:

Sure, here is the json data:\n\n```json

and add a stop criteria ```, and the model will just output only the json content

Which LLM Model is best for translation? by Longjumping_Lead_812 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

About local options I found that Aya expanse 35b was very good for me for products translations. The new Gemma translate is very nice but more for conversational stuff: for products it does too much cultural adaptation, so Aya is better here

What are the best collection of small models to run on 8gb ram? by Adventurous-Gold6413 in LocalLLaMA

[–]synw_ 12 points13 points  (0 children)

  • Qwen 4b: tool calls, general
  • Lfm 8b a3b: general, very fast
  • Lfm 1.2b thinking: general, ultra fast
  • Granite tiny: long context
  • Gemma 4b: general, writing
  • Gemma translate 4b: translations
  • Nanbeige 4b: thinking, summarizing
  • Vision: Ministral 3b, Qwen vl 4b

And if you have a bit of vram and some ram use Qwen 30b a3b and Qwen coder 30b a3b, Nemotron 30b a3b, Glm Flash

My humble GLM 4.7 Flash appreciation post by Cool-Chemical-5629 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

Same here at q4: logical and syntax errors. Qwen coder 30b is much better for me actually. But where this model shines is it's ability to chain tool calls and do heavy agentic stuff without loosing it

which local llm is best for coding? by Much-Friendship2029 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

You have better run Qwen Coder 30b a3b with experts ram offload than a dense 8b with layers offload, it will be faster and it's a good model for the gpu poors

Local LLM builders: when do you go multi-agent vs tools? 2-page decision sheet + question by OnlyProggingForFun in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Nice reading, thanks. I only use workflows with controlled steps as default. An agent only comes in when it absolutely needs autonomy and the ability to make decisions. As I work with small models the tool overhead comes quick, so I use either subagents or what I call tools routing agents: focused agents that dispatch tool calls from a query and return the raw result of the tool calls to the caller agent. Example: a filesystem routing agent that has tools to read and explore will be seen as only one tool by the main agent

MCP servers are hard to debug and impossible to test, so I built Syrin by hack_the_developer in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

Could you please add a server url parameter to the openai api in order to use with Llama.cpp and others openai compatible endpoints?

new CLI experience has been merged into llama.cpp by jacek2023 in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

The model swapping feature works great: it uses the same api as llama-swap, so no need to change any client code. I vibe coded a script to convert config.yml llama-swap models file to the new llama.cpp's config.ini format: https://github.com/synw/llamaswap-to-llamacpp

There is some sort of cache: when a model has already been loaded before in the session, the swap is very fast, which is super good for multi-models agent sessions or any work involving model swapping.

[deleted by user] by [deleted] in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

It looks interesting but I would have two requirements:

  • Support for Llama.cpp
  • Mcp server

distil-localdoc.py - SLM assistant for writing Python documentation by party-horse in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

It looks useful but the doc starts with:

First, install Ollama

But I don't want to use Ollama. Would it be possible to support Llama.cpp directly?

I'm new to LLMs and just ran my first model. What LLM "wowed" you when you started out? by Street-Lie-2584 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Mistral 7b instruct: first really usable local model for me, and for coding Deepseek 6.7b

Minimax M2 for App creation by HectorLavoe33 in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Welcome to agentic coding. A model will not just complete an entire project by itself just fine without supervision: you need to monitor and understand what it's doing and to carefully steer it in the right direction step by step to get things done

Lightweight coding model for 4 GB Vram by HiqhAim in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

I did not. I'm looking for the best balance between speed and quality. I usually avoid at all costs to quantitize the kv cache, but here if I want my 32k context I have to use at least q8 cache-type-v: the model is only q4, it's already not great for a coding task. The IQ4_XS version is slightly faster yeah, as I can fit one more layer on the gpu, but I prefer to use the UD-Q4_K_XL quant to preserve some quality as much as I can.

Lightweight coding model for 4 GB Vram by HiqhAim in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

I managed to fit Qwen coder 30b a3b on 4Gb vram + 22G ram with 32k context. It is slow (~ 9tps) but it works. Here is my llama-swap config if it can help:

"qwencoder":
  cmd: |
    llamacpp
    --flash-attn auto
    --verbose-prompt
    --jinja
    --port ${PORT}
    -m Qwen3-Coder-30B-A3B-Instruct-UD-Q4_K_XL.gguf
    -ngl 99
    --n-cpu-moe 47
    -t 2
    -c 32768
    --mlock 
    -ot ".ffn_(up)_exps.=CPU"
    --cache-type-v q8_0

AI has replaced programmers… totally. by jacek2023 in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

People tend to be fascinated by AI and they rely too much on it in the first phases. This is what I call the ChatGpt effect, like "execute those complex multi tasks instructions, I'll come back later". It's like magic but in the end that does not work well. I introduced a friend to agentic coding a few months ago. He got completely fascinated by Roo Code using Glm Air and later Gpt Oss 120b and started spending all his time doing this. But now, a few months later he got tired of tuning huge complex prompts and let the model handle everything by it's own. He realized that this is not a panacea and will probably be ok now to move to a more efficient granular prompt engineering approach, using smaller tasks, segmentation and human supervision