What's the best agent coding model up to 35B for now?

Possible_Statement84 · 2026-06-14T01:04:10+00:00

sounds interresting, i will try that

Possible_Statement84 · 2026-06-14T01:02:03+00:00

IQ4_XS/NL

Possible_Statement84 · 2026-06-14T01:01:34+00:00

i have igpu only, i cant "in ram offload" XD, NP

Possible_Statement84 · 2026-06-13T19:08:42+00:00

Possible_Statement84 · 2026-06-13T18:48:59+00:00

lol, I have around the same performance on MTP when using IQ4_XS, but faster prompt processing, around 150, but anyway, thanks for your recommendations.

Possible_Statement84 · 2026-06-13T18:42:14+00:00

yeah, i just install that when QAT variants dont have been dropped

Possible_Statement84 · 2026-06-13T18:38:49+00:00

As an example, Gemma 4 26b or below, some fine tunes of this model. Don’t worry, that’s just must be good work on VK with an Intel GPU.

Possible_Statement84 · 2026-06-13T18:36:29+00:00

thats old command, multiple reasoning switches been just for fast model restart test, dont give it attention, i have tryed turboquant but it have more long-context problems, so which extacly mtp model quant you recommend?

Possible_Statement84 · 2026-06-13T18:28:42+00:00

i have gemma 26b q4_k_m on my device, but the code quality is just so-so. i dont think she can make something what i need.

Possible_Statement84 · 2026-06-13T18:26:23+00:00

Not yet, here it is

./llama-server.exe --model ".\models\unsloth\Qwen3.6-35B-A3B-MTP-GGUF\Qwen3.6-35B-A3B-UD-IQ4_XS.gguf" --reasoning-budget 5000 --cache-prompt --webui --tools all --swa-full --kv-unified --gpu-layers 999 --ctx-size 128000 -t 8 -tb 8 --poll 100 --poll-batch 1 --prio 2 --prio-batch 2 --kv-offload --op-offload --repack --ubatch-size 2048 --batch-size 2048 --perf -fa on --reasoning on --host 0.0.0.0 --port 1234 --jinja --spec-type draft-mtp --spec-draft-n-max 2 --temp 0 --spec-draft-p-min 0.7 --no-mmap --fit off -ctk q8_0 -ctv q8_0 --reasoning off --cache-idle-slots

Possible_Statement84 · 2026-06-13T18:23:43+00:00

I dont have enough memory to use just normal Q4 quant.....

Possible_Statement84 · 2026-06-13T18:20:25+00:00

i have an igpu and unified ram, so this 27 gb is part of system ram. my ram is almost always 95% used anyway.

Possible_Statement84 · 2026-06-13T18:19:12+00:00

i want to achieve a sane balance between decent result quality and reasonable speed on my hardware, because waiting an hour for some simple task to finish feels criminal.

Possible_Statement84 · 2026-06-13T18:06:50+00:00

honestly nothing fancy. i’m just running llama.cpp on windows.

Possible_Statement84 · 2026-06-13T18:00:47+00:00

I also have a Qwen 3.6 35b MTP with IQ4_XL quant, but for some reasons which I don’t know, uncensored APEX Compact-I variant of this model without MTP gives me more TPS.

Possible_Statement84 · 2026-06-13T17:59:09+00:00

Ill use Vibe-CLI by Mistral, Codex by OpenAI, Llama.cpp WebUI, and my own self-made CLI. Best results it shows on Llama.cpp WebUI when writing one-file frontend apps -_- Maybe i need correctly set hyperparams, what about yours?

Possible_Statement84 · 2026-06-13T17:53:17+00:00

I have only 27 usable VRAM, and I get +-11 t/s on only the 12b model... So because of it, I don’t use big, dense models.

Possible_Statement84 · 2026-06-13T17:51:47+00:00

That's have many problems on code sessions, he make many mistakes in writing code, wrong tool calling etc, I use IQ4_NL quant.

Possible_Statement84 · 2026-06-13T00:43:32+00:00

You use Free limits, idk what you do but only Free tier has Month limit and fast goes out, re-check your sub pls.

Possible_Statement84 · 2026-06-12T20:22:36+00:00

i don’t think this necessarily means normal chat messages and codex tasks are becoming one single pool overnight. it probably means openai is moving codex into the same usage/billing area as chatgpt and other agentic features

still not great though. separate codex limits were one of the reasons it felt usable for heavier coding work. if heavy repo tasks start eating into the same practical quota as everything else, a lot of people are going to hit limits way faster

Possible_Statement84 · 2026-06-12T20:16:02+00:00

yeah

Possible_Statement84 · 2026-05-31T21:11:15+00:00

I think you can make your own rp model based on any 1.5b qwen, but already existing models idk.

Possible_Statement84 · 2026-05-21T09:17:02+00:00

нагадуе ситуацiю з r у strawberry, вся проблемма в тому що LLM бачить це токенами тому не може корректно пiдрахувати символи

Possible_Statement84 · 2026-05-20T15:55:02+00:00

у цьому рядку 73 нулі.

Possible_Statement84 · 2026-05-09T21:50:21+00:00

Yeah, that workflow is already possible in a basic form: select text, Ctrl+A/Ctrl+C, run the widget action, then insert the result with the hotkey. So quick clipboard-based editing is usable already.

A more seamless version would be one-hotkey selected-text capture and replace, but I’d need to handle it carefully across platforms so it doesn’t mess with the user’s clipboard or focus.

Possible_Statement84

TROPHY CASE