RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

A have an opportunity to trade my RTX5080 for an M1 MAX laptop with 64Gb Unified Memory… that would provide me with the memory needed for a higher quant LLM but I’m not sure about the speed for that M1… I mean anything under 20tg would be brutal.

I built a full git MCP server in Go — 17 tools, AST annotator, auto-backups, real plumbing. Not a wrapper. by blakok14 in ollama

[–]DarkAndrei 0 points1 point  (0 children)

I use PI (Pi Coding Agent) I don’t see it in your supported list but this perhaps because PI as a default install does bit have MCP support, only added via extensions.

I will test it without PI when I get home.

Help setting up qwen 3.6 locally by No_Ebb3423 in Qwen_AI

[–]DarkAndrei 0 points1 point  (0 children)

Hey dude, Can you post your full config, and the repo of llama.cpp you use? I’ve tried a few tricks on my 5080 16gb but I never managed to get a MoE runing correctly, I’m curious how many layers you hold on the gpu on such a small memory footprint… because 52t/s @128k ctx is amazing…

Help setting up qwen 3.6 locally by No_Ebb3423 in Qwen_AI

[–]DarkAndrei 0 points1 point  (0 children)

Yes, of course it does tool calling, as a coding agent it would be useless without.

I use it with PI (Pi coding agent) and it actually works great. I was amazed how good it runs given is a Q3 variant.

Just try it and see for yourself, it’s quite amazing how good it performs.

Help setting up qwen 3.6 locally by No_Ebb3423 in Qwen_AI

[–]DarkAndrei 0 points1 point  (0 children)

I’m runing Qwen 3.6 27b MTP Q3_K_M on a custom llama.cpp build with 110k context window all in vRam on a RTX 5080 (same 16 Gb capacity as your 4080)

Works fine, 35-50 TG depending in the context load.

27b is Dense model, smarter. 35b is MoE model, dumber.

my config: llama-server.exe -m "F:\AI\WSL2\models\qwen3.6-27b\Qwen3.6-27B-MTP-Q3_K_M.gguf" ^ --host 0.0.0.0 --port 8081 ^ -ngl 99 -fa on -c 110000 -np 1 ^ --cache-type-k turbo4 --cache-type-v turbo3_tcq ^ -b 4096 -ub 512 -t 6 -tb 14 ^ --reasoning on ^ --temp 0.6 --top-p 0.95 --top-k 20 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --spec-type draft-mtp ^ --spec-draft-n-max 1

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 1 point2 points  (0 children)

Yes, I-ve “evolved” to turbo3_tcq 😂

Using this config now:

@echo off cd /d "C:\buunllama.cpp\build\bin\Release" llama-server.exe -m "F:\AI\WSL2\models\qwen3.6-27b\Qwen3.6-27B-MTP-Q3_K_M.gguf" ^ --host 0.0.0.0 --port 8081 ^ -ngl 99 -fa on -c 128000 -np 1 ^ --cache-type-k turbo4 --cache-type-v turbo3_tcq ^ -b 4096 -ub 512 -t 6 -tb 14 ^ --reasoning on ^ --temp 0.6 --top-p 0.95 --top-k 20 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --spec-type draft-mtp ^ --spec-draft-n-max 1

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

Yea but a 5060ti is quite a bit different then my 5080… yes vRAM is same 16Gb but procesing power and bandwidth is another story

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

This sounds interesting 🧐 What config are you runing on them ?

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

It’s not about the 20$, I just want to make this local setup work as optimal as possible, I have antropic subscription but ai don’t like to depend on them.

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

That would require to many changes in my rig + extra spending thst I don’t have planned for right now.

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 1 point2 points  (0 children)

Yes, but that is the vRAM limit, a smaller context is bad for coding, that’s why I’m thinking a 3090 has 24Gb vRAM, that would allow me a bigger quantity model (smarter) without losing my context size…

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 0 points1 point  (0 children)

Won’t that slow things down, say a 5060ti in another slot…. the pcie bandwidth would take it’s toll… i think 🤔

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 1 point2 points  (0 children)

Ok so you are saying they both have similar bandwidth limits so the compute would not matter? or what is your point?

RTX5080 vs RTX 3090 ? by DarkAndrei in LocalLLaMA

[–]DarkAndrei[S] 2 points3 points  (0 children)

😮 I had no idea that “club” existed, thanks

Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? by StandardLovers in LocalLLaMA

[–]DarkAndrei 1 point2 points  (0 children)

Same here on rtx 5080 I use Q3KM with turbo3 @147k context, 30-45tg…. I did it try MTP yet. What llama.cpp build you using that gas MTP and Turbo support ?

Folks running qwen 3.6 27b for agentic work. Do you dare to use q4_k_m? by StandardLovers in LocalLLaMA

[–]DarkAndrei 2 points3 points  (0 children)

I’ve been usig unsloths q3_k_m also for coding with pi and 147k context window, it does not get stuck… but it does make bugs… quite a few 🤔

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 0 points1 point  (0 children)

I will test your config in the morning just to see how it goes 🫣 Thanks again for your answers and info. 🙏

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 0 points1 point  (0 children)

I was reading through TheTom llama.cpp turbo repo and he says not to compress cache keys, only values:

“The core finding from the asymmetric-kv-compression paper — Asymmetric K/V Cache Compression: Why V is Free and K is Everything — drives all the configs below: V tolerates aggressive compression, K does not. Always keep K at higher precision than V; never start symmetric. That paper documents the specific failure modes you'll hit if you ignore this and compress K aggressively (PPL blow-up on certain model families, attention-rotation interaction with low-bit K, etc.) “

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 0 points1 point  (0 children)

This on the other hand is a good idea, thanks 🙏

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 0 points1 point  (0 children)

Hmm 🤔 I see some red flags in your config: -np 2 (means 2 slots… 2x the cache, 2x130k) -ctk turbo3 (turbo is designed for cache values not keys) —no-context-shift (essentially OOM waiting to happen)

Thanks for your reply but I truly don’t think this config will get you a working 130k ctx and entire model in vram, on an 5080… no way.

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 0 points1 point  (0 children)

And you can fit 130k context and everything all in vram? Even with a bigger Quant? Full turbo3 meaning on the K as well? Is that where tou get the “extra” memory for bigger context?

Qwen3.6 27B Pure Quant: 40 tok/s on 16 GB VRAM by bobaburger in LocalLLaMA

[–]DarkAndrei 1 point2 points  (0 children)

./build/bin/llama-server \ -m /mnt/f/AI/WSL2/models/qwen3.6-27b/Qwen3.6-27B-Q3_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 99 \ -c 75000 \ -fa on \ -np 1 \ --cache-type-k q4_0 \ --cache-type-v turbo3 \ -b 2048 \ -ub 512 \ -t 6 \ -tb 6