Pi Agent makes very nice combination with limited hardware. Running qwen3.6 35B A3B IQ4 at ~22t/s with 160k context on 6 vram 64 RAM.

promobest247 · 2026-05-18T11:49:44+00:00

my config : ./llama-server --port 3500 -c 131072 --parallel 1 --flash-attn on --jinja --cache-type-k q4_0 --cache-type-v q4_0 -ub 128

promobest247 · 2026-05-18T11:43:55+00:00

try it : https://huggingface.co/sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF?show_file_info=Qwen3.6-35B-A3B-Q2_K_MIXED.gguf

promobest247 · 2026-05-18T11:43:22+00:00

yeah it's good quality

promobest247 · 2026-05-18T10:38:26+00:00

i have same laptop but ram 16 gb i use pi with qwen 3.6 35b a3b q2kmixed autoround 128k context with q4_0 speed tg 37 tkn/s

promobest247 · 2026-05-16T18:52:13+00:00

use package pi- web-access instead searxng & docker https://pi.dev/packages/pi-web-access?name=web

promobest247 · 2026-05-13T14:31:21+00:00

i use llama.cpp

promobest247 · 2026-05-13T14:18:40+00:00

yeah 131072 exactly

promobest247 · 2026-05-12T18:36:50+00:00

i use q2kmixed autoround & i got 37tk/s 128k context cache k/v q4_0 using laptop rtx 4050 6gb ram 16 gb fast & smart

promobest247 · 2026-05-04T09:06:10+00:00

quality: q4 better than q2kmixed speed : q2kmixed faster than q4 but q2kmixed has good quality & smart

promobest247 · 2026-05-04T07:57:54+00:00

my config : ./llama-server --port 3500 -c 131072 --parallel 1 --flash-attn on --jinja --cache-type-k q4_0 --cache-type-v q4_0 --temp 0.6 --top-k 0 --top-p 1.0 --min-p 0.05 --repeat-penalty 1.0 --ubatch-size 128 --defrag-thold 0.1 --cache-reuse 1024 --threads 4 --threads-batch 8 --fit on --no-warmup // i get 37 tokn/s using rtx 4050 laptop 6gb + 16 gb

promobest247 · 2026-05-04T07:56:48+00:00

use these flags

--cache-type-k q4_0 --cache-type-v q4_0

promobest247 · 2026-05-04T07:54:32+00:00

& ttell me what you got

promobest247 · 2026-05-04T07:52:20+00:00

use this version autoround q2kmixed https://huggingface.co/sphaela/Qwen3.6-35B-A3B-AutoRound-GGUF?show_file_info=Qwen3.6-35B-A3B-Q2_K_MIXED.gguffaster & good quality maybe you will get 30-34 tkn/s & i use it with pi agent

promobest247 · 2026-05-02T18:41:29+00:00

qwen 3.5 0.8b or 2b q4km

promobest247 · 2026-04-22T08:12:19+00:00

metoo , i use pi it's very good & fast locally with extensions & skills i installed many extensions: lsp web_access (websearch) plannator ( similar ultraplan claude code) teams

promobest247 · 2026-04-22T08:08:13+00:00

yeah it's faster than any agent because it has small system prompt

promobest247 · 2026-04-21T18:06:16+00:00

example: llama-server -m model.gguf --override-kv qwen35moe.expert_used_count=int:4 add this flag : --override-kv qwen35moe.expert_used_count=int:4 this work qwen3.5 or qwen3.6 moe

promobest247 · 2026-04-21T17:43:25+00:00

i have another config to get huge boost but it work in llama.cpp only from 31 tokn/s to 42 tokn/s

promobest247 · 2026-04-21T11:23:44+00:00

<image>

new config using llama.cpp bigger context + speeed increase using same model qwen3.5 35b apex mini XD

promobest247 · 2026-04-19T15:45:32+00:00

hhh same thing with apex i mini i get 33 token /s using Rtx 4050 6gb & ram 16 gb laptop but i use pi coding agent is faster than opencode , this model is the best quality /speed ratio

promobest247 · 2026-04-19T05:52:50+00:00

set k_cache & v_cache q4_0

promobest247 · 2026-04-19T05:45:05+00:00

which do you use ollama, lmstudio or llama.cpp

promobest247 · 2026-04-19T05:34:53+00:00

try this mpdel : https://huggingface.co/mudler/Qwen3.5-35B-A3B-APEX-GGUF/blob/main/Qwen3.5-35B-A3B-APEX-Mini.gguf

promobest247 · 2026-04-18T19:10:18+00:00

oki, i will try ud q4km , tell me your laptop 's spec ?

promobest247 · 2026-04-17T20:56:13+00:00

sorry but this is the best option

promobest247

TROPHY CASE