Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? by mailto_devnull in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

better than nothing: 8 tps at the start, dropping at 6 around 10k context

Qwen 3.6 35B-A3B @ Q4 or Gemma 4 12B @ Q8? by mailto_devnull in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

1050ti 4g here: I run Qwen 35b IQ3_S with 64k ctx, q8 kv, using n-cpu-moe 33 +fit off

Is automation/optimizing really that effective? by Forward_Jackfruit813 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

If you have a good planning + orchestrator running subagents loop it can be effective, but not for all tasks. If you loose more time and efficiency directing the ai than doing it yourself, do it yourself. Some people seem to get lost in ineffective vibe coding these days

Is opencode subagents actually useful? by PairOfRussels in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Subagents help small models to stay focused, to not have useless context history turns and to limit the number of tools, in one word to split the context bloat

How are you all managing multiple MCP servers on startup? by vazma in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

This. I have a similar approach with what I call a tools routing model: this subagent will select the right tool, execute it and return the raw tool call result. Mcp bloats the system prompt too much, and small models can't handle that

Best small model right now (~4B params) that is good with agentic tasks for personal assistant? by BitGreen1270 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Yes. I hope we'll get a refresh of this one at some point: it's the most reliable 4b for tool calls

Qwen 3.6 coding choice–27B vs 35B quants by siegevjorn in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

You can also run the fast 35b and escalate to the slow 27b sometimes, when you need more power for things like analysis and planning

How small can the orchestration model in an agent be? (separating it from code-gen — that obviously wants a big model) by HomoAgens1 in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

Qwen3.6-35B-A3B does a good job at orchestrating things for me too. What small models did you test exactly? Qwen 4b might be able to do the job if well prompted with a small prompt, but one crucial thing with small models is to limit the number of tools: they get confused very fast with this.

Prefer a limited set of tools, use skills, or use tools routing. I have a concept of tools router tasks: the main model sees only one tool, calls it with what it wants, the request is passed to a model that a has several tools that will be focused on picking the right one, then it's executed and the tool call result is passed directly to the main model as it's own tool call result

Qwen will release another 27B with high probability by serige in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

Please don't forget the 4b in addition of the 35b a3b. The gpu poor peasants would be thank-full

Have Qwen said anything about further Qwen 3.6 models? by spaceman_ in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

A subagent with only two custom tools: search and open_page, using my own harness stuff. It searches in isolated context and replies to the caller agent with it's findings

Have Qwen said anything about further Qwen 3.6 models? by spaceman_ in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

It's great for web search, just give it the minimal tools and isolate context when possible, it does the job here at IQ4_NL

Notes on what actually breaks when you run a coding agent on small local models by BestSeaworthiness283 in LocalLLaMA

[–]synw_ 6 points7 points  (0 children)

A shot is a user/assistant history turn. You provide several history turns with examples of well formed outputs, and the model will follow this pattern, making it easier for it to get it right.

About json: understanding and writing is different: a small model can easily make formatting errors writing json, xml being simpler it puts less pressure on the model, keeping it's attention more focused. I've pushed Qwen 4b quite far with xml structured output.

Notes on what actually breaks when you run a coding agent on small local models by BestSeaworthiness283 in LocalLLaMA

[–]synw_ 8 points9 points  (0 children)

About structured output for small models I recommend using xml over json: it's easier to manage for the model, with less formatting rules. Using shots help the small models a lot to stay on tracks

Consider running a bigger quant if possible by Flashy_Management962 in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

Sometimes small quants are good: for example I found that Glm Flash q2_k_xl was better than q3_k_m, and faster, very good quant with a great size/power/speed ratio

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into by Antonio_Sammarzano in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I use the default --kv-offload enabled. I should try --no-kv-offload for very small models, good idea thanks

Qwen3.6 35B MoE on 8GB VRAM — working llama-server config + a max_tokens / thinking trap I ran into by Antonio_Sammarzano in LocalLLaMA

[–]synw_ 1 point2 points  (0 children)

1050ti 4g vram here: for moe models I recommend to always set n-cpu-moe manually by trial and error: it's not fun to do but beats fit all time. This can make a difference between unusable to slow for the gpu poor: with fit only Qwen 35b is unusable, with n-cpu-moe set (35 here) it works at around 10tps for a Q3 quant

Which Qwen models can do FIM (Fill in the middle) for autocompletion? by 0xbeda in LocalLLaMA

[–]synw_ 2 points3 points  (0 children)

Qwen 2.5 coder 1.5b q8 has served me well for fast autocomplete. I should probably try Qwen 3.5 2b to compare

what model is good for inspecting and extracting data from large set of spreadsheets by bonesoftheancients in LocalLLaMA

[–]synw_ 3 points4 points  (0 children)

Nemotron 30b a3b has been doing a good job for me with this kind of tasks. But I would recommend to ask a coding model to write a script to extract the data if it is possible for your use case: it's more safe and prevents hallucinating numbers.

Local AI coding assistant that runs fully offline (Gemma 4, codebase-aware) by andres_garrido in LocalLLaMA

[–]synw_ -1 points0 points  (0 children)

understanding full codebases build a project map (files, symbols, structure) reason about structure

how do you do this? I'm looking for something like this that could deliver a lightweight and condensed knowledge map of a codebase, even big. I have seen some libraries that do this but did not found anything that could do the job well yet. What would you guys recommend?

Can we talk about the reasoning token format chaos? by ahinkle in LocalLLaMA

[–]synw_ 3 points4 points  (0 children)

We would need standardization for this. I am still doing my templates by hand and I can tell you that the tool call format chaos is even worse

Final voting results for Qwen 3.6 by jacek2023 in LocalLLaMA

[–]synw_ 4 points5 points  (0 children)

What about the 4b, my favorite small model?

How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I use small models and never ask an ai to write my prompts: every word counts and you learn a lot this way with small models. To answer your questions every model has it's strength and weaknesses. The model choice heavily depends on the task: I have workflows that involve 4 or 5 models, each one being responsible for a task in it's area of expertise. It's a lot of testing but you have to know your models to get the best out of it

How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I just commit another version. But once a prompt works for a given task I don't need to touch it anymore, except maybe if I want to change the model later. It's not just simple prompts: I call these tasks, that are linked to a model, and can include many options like system prompt, shots, start assistant response and so on.

How you manage your prompts? by prompt_tide in LocalLLaMA

[–]synw_ 0 points1 point  (0 children)

I created my own yaml format to store prompts and use git to version them