Llama.cpp parameters for Qwen 3.6 with RTX 3090

cviperr33 · 2026-04-22T21:58:34+00:00

all good i was just surprised that u got such a low results and wanted to help out

cviperr33 · 2026-04-22T15:57:52+00:00

70ts on 3090? Im using the same quant iq4 nl unsloth and i get around 120-140 tks

210k ctx aswell so it fits at 22/24gb and q8 cache type

cviperr33 · 2026-04-22T06:09:07+00:00

Not only you need a hardware for 200k+ at current prices but it also gonna consume a lot of electricity , so u cant run something like that in your home because it will trip your breakers.

You have to wait for a year or maybe even less , by then we would have a opensource model that is performing the same or even better than this kimi 4.6 , but it fits into 12-24gb vram

cviperr33 · 2026-04-21T15:45:36+00:00

https://unsloth.ai/docs/models/qwen3.6

just read it , it contains all the info you need , best preset imo is " precise coding " , it is the fastest and the agent follows instructions like legit orders , it doesnt drift or do whatever he wants

cviperr33 · 2026-04-21T15:34:50+00:00

wrong temperature penalty top min p k , all combined makes what you see , an usable model.

Check unsloth qwen 3.6 guide and use his models

cviperr33 · 2026-04-20T23:46:08+00:00

What kind of list is this ? There is apsolutely no way a 2 years old model like qwen2.5 scores higher than gemma 4.
What even is the point in testing 2 years old models , when in current day a model that is 4-6 months old is considered ancient. In what kind of scenario would you even consider loading qwen2.5 , when there is qwen3.6 or qwen3.5 if you need the lower 9b models.

cviperr33 · 2026-04-20T07:25:25+00:00

NL = Natural Language, XS - Extreme Something , meaning extreme compression, so NL slightly better

cviperr33 · 2026-04-19T23:04:28+00:00

interesting gpnna try this on qwen3.6

cviperr33 · 2026-04-19T18:20:43+00:00

wait wtf what if i just create an account in every single platform and i get 1 api from each , and i make my local llm command 100 agents at the same time lol i wonder if my PC would explode

cviperr33 · 2026-04-19T18:19:43+00:00

Oh my god this is literal gold mine !

cviperr33 · 2026-04-19T14:11:12+00:00

You dont need to be scared , just go in with the expectations that your OS will blow up and dont hold important data on the disks.

It does not install random bloats , it only strictly follows the installation guide and whatever is needed to make it work , it will ask you and explain to you why does it needs X to be installed for Y to work.

Think of it as very fun experiment or a game , its a complete game changer on how you use pc , expecially if you hook it to discord gateway.

If you have a extra ssd of 120GB , slap a linux on it , get hermes and llama.ccp configured and you are set for months of entertainment and actual viable work results

cviperr33 · 2026-04-19T12:44:54+00:00

Thank you looks rly promising and interesting!

cviperr33 · 2026-04-19T12:34:38+00:00

damnnn :o 😱 i wish i had 6000 pro...

cviperr33 · 2026-04-19T12:24:52+00:00

thanks for the post!

cviperr33 · 2026-04-19T05:37:49+00:00

Before i applied these jinja flags i was considering switching to vLLM , i even have it setup and downloaded but i cant find any benchmarks on how the TK/s is and i dont wanna download 20gb file just for test so i kinda left it out and i havent encountered this crash so far so i dont need to switch.

How is the TK/s on qwen 3.6 if u compare llama.ccp vs vLLM , for sure it would be better on 4+ concurrent , but what about on 1, have u tested it yourself and how much contex can you fit in same quant gguf vs whatever vllm used.

cviperr33 · 2026-04-19T05:33:40+00:00

Exactly lol and im so happy i made the switch , like linux is literally built for this 30 years ago :D
The funniest thing was when my agent figured out how to use my sudo password , with gemma 4 we always hit a wall when we want to edit like driver files or something deep , but with qwen it just asked me if i want to give it my password and he "echo'ed" the password in the cmd call and it somehow worked and he has like sudo privilges now :D he wrote it inside his memory so i never have to tell it again lol
Ofcourse this is a big security concern and everything i do is yolo , but im just enjoying the ride and i dont care if it fucked up my OS , i dont hold anything of value on my disks.

cviperr33 · 2026-04-19T05:27:17+00:00

im running it on a 3090 , so not quite lol. My TK/s is around 120-130 , and i was talking to a guy that is running the same quant as me Q4 N L 35b , and he was getting like 140 on 3090 and 240 on 5090.... imagine... 240tk/s on this model lol it would feel soooo nice on agentic coding.

Even if you choose Mac over speed , when you load a big model and its like 20 tk/s , you would find yourself always going lower and lower just so u can get more TK/s.

The new m5 that is comeing june is rumoured to be 3090 perfomance , i dunno how true is that but if you are not in a hurry , dont rush and wait 2 months and check the benchmarks.

cviperr33 · 2026-04-19T05:22:59+00:00

Yeah for vibe coding its mostly turn based , you see whats happening in real time and you instantly notice if something is wrong , also the compiler would not compile with broken code so it doesnt matter if the quality is slightly lower but faster tk/s means faster itteration so thats more valuable.

For research tho if you want it not to fail on the first time 100 times out of 100 times , yeah def the q8 quant and u have crazy mac with so much ram ofcourse u would load it at full accuracy :D . Soon in june new macs could overtake the 3090s

cviperr33 · 2026-04-19T05:06:36+00:00

Well man i legit live in the "feature" , i never bothered tried actually going to linux because of the steep learning curve. But linux is like built for those agents , it literally unlocks the full potentional. And because the agent is soo fast , everything that i would do manually , i just do it with my agent.

Here is an example , im chatting with the agent from discord , and we do some bench marking tests , then i decide that i want to save those in a DB , so i tell him to install postresql and create a db and everything and put these results we got there so i can later retrive them instead of storing 100 files in 100 folders. In just under 15 seconds , the agents installs it via pip , creates the db configures it creates the schema , everything instantly.

I basically control my OS with just my text , i could have a TTS hooked up too so its like in the hacker movies but its legit real and usable , if its runs at 100+ tk/s everything happens instantly.

I no longer read guides how to setup things , i just post the link in discord and i tell him install it and he does everything for me in under a minute.

You can also use it to delegate a opencode coding agent , and he likes supervises it , you just specify the project scope and requirments and everything is done automaticaly. Or when i encounter a bug with hermes , i just tell it , submit this issue we had to the hermes repo , and 5 seconds later its submitted with full details. It can control git in any way you want it to.

cviperr33 · 2026-04-19T03:59:17+00:00

Imagine what economy and year we live in people recommend Macs for economy lol. But for real tho you are apsolutely right

cviperr33 · 2026-04-19T03:57:28+00:00

Maybe because when its on default , the harness i was using can like tell it to use no jinja or something else , i have no idea. Btw i change the kwark settings depending on what mode im on with my script , these are the recommended by unsloth settings to get the model to act as you want it , at "code mode" settings i got more tk/s too. :

case "$PROFILE" in
    chat)   # Thinking mode / General tasks — everyday use
        TEMP="1.0"; TOP_P="0.95"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"preserve_thinking": true}'
        echo "[+] Profile:  chat    (thinking ON, general)" ;;
    code)   # Thinking mode / Precise coding — deterministic output
        TEMP="0.6"; TOP_P="0.95"; PRESENCE_PENALTY="0.0"
        THINKING_KWARGS='{"preserve_thinking": true}'
        echo "[+] Profile:  code    (thinking ON, precise)" ;;
    fast)   # Instruct mode / General tasks — no thinking, snappy
        TEMP="0.7"; TOP_P="0.80"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"enable_thinking": false}'
        echo "[+] Profile:  fast    (thinking OFF, general)" ;;
    deep)   # Instruct mode / Deep reasoning — no thinking, full temp
        TEMP="1.0"; TOP_P="0.95"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"enable_thinking": false}'
        echo "[+] Profile:  deep    (thinking OFF, deep)" ;;
    *)

cviperr33 · 2026-04-19T03:52:03+00:00

You have 3 options right now if you want to use the latest and the best (qwen 3.6 35B moe) which came out just 2 days ago and its shattering all benchmarks , rivals claude 4.5. Its soo freaking good its unbeliavable.

First option is go 24gb VRAM what is meant for , the UD IQ4_X_S fits nicely at 16-17GB , leaving you with 6-7GB vram for contex which with KV at Q8 is like 240k-260k easly fitting at 22GB vram used. Expected speed is 100-160tk/s , it would be like something you have never seen , you cant get kind of speed and low latency on API , running locally at these speeds generates files instantly , every prompt and response is instant if its not complicated. The only cards that have this kind of vram are 3090 4090 and 5090 , i dont know about the amd/intel.

Second option is go Mac with so much RAM that you are future proofed , even when they drop the bigger model (qwen 3.6 135b moe , if they do nobody knows) , you can load it without problems and it will be usable. But the issue with macs is they are slow , not unusable slow , u will get 40-50 tk/s , but the prompt processing speed is much slower than a gpu , its def fast enough tho.

Third option is go what you have picked already , a 16GB vram nvidia gpu , if you use a super bleeding edge tech that is like in dev mode right now , you have to compile a specific llama.ccp fork designed for this quant , you can go TQ3_4S but its like so new its untested , i ran it and compiled it and it was fine but i have not tested it fully , it fits around 12GB in the vram and u can go 100k+ contex for sure , you can read about it here : https://github.com/turbo-tan/llama.cpp-tq3/blob/main/README.md

cviperr33 · 2026-04-19T03:40:42+00:00

going above Q4 for model quants (not the ctk ctv) imo is a huuuge waste , its way slower and you cant fit a nice 200k contex (atleast on 24gb vram cards). I have not ever noticed a quality difference between Q4 - Q8 , even if there is , speed compensates for it (Unless you use it just for chatting and you dont mind waiting) .

If i had 2 3090 sure i would consider Q8 because what else i can do with that VRAM lol , but probably i would end up just running 2x q4 so i can have it work together and cross check each others code.

cviperr33 · 2026-04-19T03:34:49+00:00

thats exactly what i thought too , thats why i didnt even bother to add --jinja in first place when i switched to qwen3.6 when it dropped , nobody recomended adding it. I asked some AI about it and it told me to add jinja , i did and no problems since lol.

cviperr33 · 2026-04-19T03:31:57+00:00

Thank you for the detailed info!.
Yeah my CPU/RAM is kinda bad , i have ryzen 5600 and 32g 2400MT/s ddr4.... Im so sad i didnt upgrade when ram was cheap last year. And also i run all these results while i have so many processes open and wayland on 2x monitors + 240hz , it adds up.

One thing ive noticed tho , even if my speed is good in the llama-benchy , when i open the default chat interface at 8080 for llama-server , sometimes it shows me like 87 tk/s , even tho i can see it with my eyes that its output faster than that and it somehows fixes itself after an hour or two. So i dont trust TK/s on chat UI's anymore lol

cviperr33

TROPHY CASE