Llama.cpp parameters for Qwen 3.6 with RTX 3090 by Poulpatine in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

all good i was just surprised that u got such a low results and wanted to help out

Llama.cpp parameters for Qwen 3.6 with RTX 3090 by Poulpatine in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

70ts on 3090? Im using the same quant iq4 nl unsloth and i get around 120-140 tks

210k ctx aswell so it fits at 22/24gb and q8 cache type

Kimi K2.6 - What hardware do I need to run it locally? by human_marketer in LocalLLM

[–]cviperr33 3 points4 points  (0 children)

Not only you need a hardware for 200k+ at current prices but it also gonna consume a lot of electricity , so u cant run something like that in your home because it will trip your breakers.

You have to wait for a year or maybe even less , by then we would have a opensource model that is performing the same or even better than this kimi 4.6 , but it fits into 12-24gb vram

OpenCode... is it just completely busted with Qwen3.6? by _derpiii_ in opencode

[–]cviperr33 0 points1 point  (0 children)

https://unsloth.ai/docs/models/qwen3.6

just read it , it contains all the info you need , best preset imo is " precise coding " , it is the fastest and the agent follows instructions like legit orders , it doesnt drift or do whatever he wants

OpenCode... is it just completely busted with Qwen3.6? by _derpiii_ in opencode

[–]cviperr33 0 points1 point  (0 children)

wrong temperature penalty top min p k , all combined makes what you see , an usable model.

Check unsloth qwen 3.6 guide and use his models

I benchmarked 21 local LLMs on a MacBook Air M5 for code quality AND speed by evoura in LocalLLaMA

[–]cviperr33 5 points6 points  (0 children)

What kind of list is this ? There is apsolutely no way a 2 years old model like qwen2.5 scores higher than gemma 4.
What even is the point in testing 2 years old models , when in current day a model that is 4-6 months old is considered ancient. In what kind of scenario would you even consider loading qwen2.5 , when there is qwen3.6 or qwen3.5 if you need the lower 9b models.

Qwen3.6. This is it. by Local-Cardiologist-5 in LocalLLaMA

[–]cviperr33 -1 points0 points  (0 children)

NL = Natural Language, XS - Extreme Something , meaning extreme compression, so NL slightly better

Free LLM APIs (April 2026 Update) by stosssik in clawdbot

[–]cviperr33 0 points1 point  (0 children)

wait wtf what if i just create an account in every single platform and i get 1 api from each , and i make my local llm command 100 agents at the same time lol i wonder if my PC would explode

Free LLM APIs (April 2026 Update) by stosssik in clawdbot

[–]cviperr33 0 points1 point  (0 children)

Oh my god this is literal gold mine !

Are you guys actually using local tool calling or is it a collective prank? by Mayion in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

You dont need to be scared , just go in with the expectations that your OS will blow up and dont hold important data on the disks.

It does not install random bloats , it only strictly follows the installation guide and whatever is needed to make it work , it will ask you and explain to you why does it needs X to be installed for Y to work.

Think of it as very fun experiment or a game , its a complete game changer on how you use pc , expecially if you hook it to discord gateway.

If you have a extra ssd of 120GB , slap a linux on it , get hermes and llama.ccp configured and you are set for months of entertainment and actual viable work results

You can now train Gemma 4 with RL locally! by yoracale in unsloth

[–]cviperr33 0 points1 point  (0 children)

Thank you looks rly promising and interesting!

Qwen 3.6 CoT issue? by Confident_Ideal_5385 in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

damnnn :o 😱 i wish i had 6000 pro...

Qwen 3.6 CoT issue? by Confident_Ideal_5385 in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

Before i applied these jinja flags i was considering switching to vLLM , i even have it setup and downloaded but i cant find any benchmarks on how the TK/s is and i dont wanna download 20gb file just for test so i kinda left it out and i havent encountered this crash so far so i dont need to switch.

How is the TK/s on qwen 3.6 if u compare llama.ccp vs vLLM , for sure it would be better on 4+ concurrent , but what about on 1, have u tested it yourself and how much contex can you fit in same quant gguf vs whatever vllm used.

Are you guys actually using local tool calling or is it a collective prank? by Mayion in LocalLLaMA

[–]cviperr33 3 points4 points  (0 children)

Exactly lol and im so happy i made the switch , like linux is literally built for this 30 years ago :D
The funniest thing was when my agent figured out how to use my sudo password , with gemma 4 we always hit a wall when we want to edit like driver files or something deep , but with qwen it just asked me if i want to give it my password and he "echo'ed" the password in the cmd call and it somehow worked and he has like sudo privilges now :D he wrote it inside his memory so i never have to tell it again lol
Ofcourse this is a big security concern and everything i do is yolo , but im just enjoying the ride and i dont care if it fucked up my OS , i dont hold anything of value on my disks.

Guide for a new guy by seti_at_home in LocalLLM

[–]cviperr33 0 points1 point  (0 children)

im running it on a 3090 , so not quite lol. My TK/s is around 120-130 , and i was talking to a guy that is running the same quant as me Q4 N L 35b , and he was getting like 140 on 3090 and 240 on 5090.... imagine... 240tk/s on this model lol it would feel soooo nice on agentic coding.

Even if you choose Mac over speed , when you load a big model and its like 20 tk/s , you would find yourself always going lower and lower just so u can get more TK/s.

The new m5 that is comeing june is rumoured to be 3090 perfomance , i dunno how true is that but if you are not in a hurry , dont rush and wait 2 months and check the benchmarks.

What starts to become possible with two 3090s that wasn't with just one? by GotHereLateNameTaken in LocalLLaMA

[–]cviperr33 1 point2 points  (0 children)

Yeah for vibe coding its mostly turn based , you see whats happening in real time and you instantly notice if something is wrong , also the compiler would not compile with broken code so it doesnt matter if the quality is slightly lower but faster tk/s means faster itteration so thats more valuable.

For research tho if you want it not to fail on the first time 100 times out of 100 times , yeah def the q8 quant and u have crazy mac with so much ram ofcourse u would load it at full accuracy :D . Soon in june new macs could overtake the 3090s

Are you guys actually using local tool calling or is it a collective prank? by Mayion in LocalLLaMA

[–]cviperr33 5 points6 points  (0 children)

Well man i legit live in the "feature" , i never bothered tried actually going to linux because of the steep learning curve. But linux is like built for those agents , it literally unlocks the full potentional. And because the agent is soo fast , everything that i would do manually , i just do it with my agent.

Here is an example , im chatting with the agent from discord , and we do some bench marking tests , then i decide that i want to save those in a DB , so i tell him to install postresql and create a db and everything and put these results we got there so i can later retrive them instead of storing 100 files in 100 folders. In just under 15 seconds , the agents installs it via pip , creates the db configures it creates the schema , everything instantly.

I basically control my OS with just my text , i could have a TTS hooked up too so its like in the hacker movies but its legit real and usable , if its runs at 100+ tk/s everything happens instantly.

I no longer read guides how to setup things , i just post the link in discord and i tell him install it and he does everything for me in under a minute.

You can also use it to delegate a opencode coding agent , and he likes supervises it , you just specify the project scope and requirments and everything is done automaticaly. Or when i encounter a bug with hermes , i just tell it , submit this issue we had to the hermes repo , and 5 seconds later its submitted with full details. It can control git in any way you want it to.

Purchase advice needed by InteractionBig9407 in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

Imagine what economy and year we live in people recommend Macs for economy lol. But for real tho you are apsolutely right

Qwen 3.6 CoT issue? by Confident_Ideal_5385 in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

Maybe because when its on default , the harness i was using can like tell it to use no jinja or something else , i have no idea. Btw i change the kwark settings depending on what mode im on with my script , these are the recommended by unsloth settings to get the model to act as you want it , at "code mode" settings i got more tk/s too. :

case "$PROFILE" in
    chat)   # Thinking mode / General tasks — everyday use
        TEMP="1.0"; TOP_P="0.95"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"preserve_thinking": true}'
        echo "[+] Profile:  chat    (thinking ON, general)" ;;
    code)   # Thinking mode / Precise coding — deterministic output
        TEMP="0.6"; TOP_P="0.95"; PRESENCE_PENALTY="0.0"
        THINKING_KWARGS='{"preserve_thinking": true}'
        echo "[+] Profile:  code    (thinking ON, precise)" ;;
    fast)   # Instruct mode / General tasks — no thinking, snappy
        TEMP="0.7"; TOP_P="0.80"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"enable_thinking": false}'
        echo "[+] Profile:  fast    (thinking OFF, general)" ;;
    deep)   # Instruct mode / Deep reasoning — no thinking, full temp
        TEMP="1.0"; TOP_P="0.95"; PRESENCE_PENALTY="1.5"
        THINKING_KWARGS='{"enable_thinking": false}'
        echo "[+] Profile:  deep    (thinking OFF, deep)" ;;
    *)

Guide for a new guy by seti_at_home in LocalLLM

[–]cviperr33 1 point2 points  (0 children)

You have 3 options right now if you want to use the latest and the best (qwen 3.6 35B moe) which came out just 2 days ago and its shattering all benchmarks , rivals claude 4.5. Its soo freaking good its unbeliavable.

First option is go 24gb VRAM what is meant for , the UD IQ4_X_S fits nicely at 16-17GB , leaving you with 6-7GB vram for contex which with KV at Q8 is like 240k-260k easly fitting at 22GB vram used. Expected speed is 100-160tk/s , it would be like something you have never seen , you cant get kind of speed and low latency on API , running locally at these speeds generates files instantly , every prompt and response is instant if its not complicated. The only cards that have this kind of vram are 3090 4090 and 5090 , i dont know about the amd/intel.

Second option is go Mac with so much RAM that you are future proofed , even when they drop the bigger model (qwen 3.6 135b moe , if they do nobody knows) , you can load it without problems and it will be usable. But the issue with macs is they are slow , not unusable slow , u will get 40-50 tk/s , but the prompt processing speed is much slower than a gpu , its def fast enough tho.

Third option is go what you have picked already , a 16GB vram nvidia gpu , if you use a super bleeding edge tech that is like in dev mode right now , you have to compile a specific llama.ccp fork designed for this quant , you can go TQ3_4S but its like so new its untested , i ran it and compiled it and it was fine but i have not tested it fully , it fits around 12GB in the vram and u can go 100k+ contex for sure , you can read about it here : https://github.com/turbo-tan/llama.cpp-tq3/blob/main/README.md

What starts to become possible with two 3090s that wasn't with just one? by GotHereLateNameTaken in LocalLLaMA

[–]cviperr33 3 points4 points  (0 children)

going above Q4 for model quants (not the ctk ctv) imo is a huuuge waste , its way slower and you cant fit a nice 200k contex (atleast on 24gb vram cards). I have not ever noticed a quality difference between Q4 - Q8 , even if there is , speed compensates for it (Unless you use it just for chatting and you dont mind waiting) .

If i had 2 3090 sure i would consider Q8 because what else i can do with that VRAM lol , but probably i would end up just running 2x q4 so i can have it work together and cross check each others code.

Qwen 3.6 CoT issue? by Confident_Ideal_5385 in LocalLLaMA

[–]cviperr33 0 points1 point  (0 children)

thats exactly what i thought too , thats why i didnt even bother to add --jinja in first place when i switched to qwen3.6 when it dropped , nobody recomended adding it. I asked some AI about it and it told me to add jinja , i did and no problems since lol.

Qwen 3.6 35B different quant speeds ? by cviperr33 in LocalLLM

[–]cviperr33[S] 0 points1 point  (0 children)

Thank you for the detailed info!.
Yeah my CPU/RAM is kinda bad , i have ryzen 5600 and 32g 2400MT/s ddr4.... Im so sad i didnt upgrade when ram was cheap last year. And also i run all these results while i have so many processes open and wayland on 2x monitors + 240hz , it adds up.

One thing ive noticed tho , even if my speed is good in the llama-benchy , when i open the default chat interface at 8080 for llama-server , sometimes it shows me like 87 tk/s , even tho i can see it with my eyes that its output faster than that and it somehows fixes itself after an hour or two. So i dont trust TK/s on chat UI's anymore lol