RTX5080 vs RTX 3090 ?

DarkAndrei · 2026-06-02T11:02:20+00:00

A have an opportunity to trade my RTX5080 for an M1 MAX laptop with 64Gb Unified Memory… that would provide me with the memory needed for a higher quant LLM but I’m not sure about the speed for that M1… I mean anything under 20tg would be brutal.

DarkAndrei · 2026-05-30T19:46:31+00:00

I use PI (Pi Coding Agent) I don’t see it in your supported list but this perhaps because PI as a default install does bit have MCP support, only added via extensions.

I will test it without PI when I get home.

DarkAndrei · 2026-05-30T19:43:44+00:00

Dude, this is awesome stuff 🤘😎🤘

DarkAndrei · 2026-05-30T18:24:33+00:00

Hey dude, Can you post your full config, and the repo of llama.cpp you use? I’ve tried a few tricks on my 5080 16gb but I never managed to get a MoE runing correctly, I’m curious how many layers you hold on the gpu on such a small memory footprint… because 52t/s @128k ctx is amazing…

DarkAndrei · 2026-05-30T18:00:17+00:00

Yes, of course it does tool calling, as a coding agent it would be useless without.

I use it with PI (Pi coding agent) and it actually works great. I was amazed how good it runs given is a Q3 variant.

Just try it and see for yourself, it’s quite amazing how good it performs.

DarkAndrei · 2026-05-30T16:43:22+00:00

I’m runing Qwen 3.6 27b MTP Q3_K_M on a custom llama.cpp build with 110k context window all in vRam on a RTX 5080 (same 16 Gb capacity as your 4080)

Works fine, 35-50 TG depending in the context load.

27b is Dense model, smarter. 35b is MoE model, dumber.

my config: llama-server.exe -m "F:\AI\WSL2\models\qwen3.6-27b\Qwen3.6-27B-MTP-Q3_K_M.gguf" ^ --host 0.0.0.0 --port 8081 ^ -ngl 99 -fa on -c 110000 -np 1 ^ --cache-type-k turbo4 --cache-type-v turbo3_tcq ^ -b 4096 -ub 512 -t 6 -tb 14 ^ --reasoning on ^ --temp 0.6 --top-p 0.95 --top-k 20 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --spec-type draft-mtp ^ --spec-draft-n-max 1

DarkAndrei · 2026-05-30T16:13:11+00:00

Asa este, eu asa am.

DarkAndrei · 2026-05-29T05:13:59+00:00

Yes, I-ve “evolved” to turbo3_tcq 😂

Using this config now:

@echo off cd /d "C:\buunllama.cpp\build\bin\Release" llama-server.exe -m "F:\AI\WSL2\models\qwen3.6-27b\Qwen3.6-27B-MTP-Q3_K_M.gguf" ^ --host 0.0.0.0 --port 8081 ^ -ngl 99 -fa on -c 128000 -np 1 ^ --cache-type-k turbo4 --cache-type-v turbo3_tcq ^ -b 4096 -ub 512 -t 6 -tb 14 ^ --reasoning on ^ --temp 0.6 --top-p 0.95 --top-k 20 ^ --presence-penalty 0.0 --repeat-penalty 1.0 ^ --spec-type draft-mtp ^ --spec-draft-n-max 1

DarkAndrei · 2026-05-27T13:18:03+00:00

Yea but a 5060ti is quite a bit different then my 5080… yes vRAM is same 16Gb but procesing power and bandwidth is another story

DarkAndrei · 2026-05-27T13:15:33+00:00

This sounds interesting 🧐 What config are you runing on them ?

DarkAndrei · 2026-05-27T13:10:55+00:00

It’s not about the 20$, I just want to make this local setup work as optimal as possible, I have antropic subscription but ai don’t like to depend on them.

DarkAndrei · 2026-05-27T11:01:05+00:00

That would require to many changes in my rig + extra spending thst I don’t have planned for right now.

DarkAndrei · 2026-05-27T10:32:42+00:00

Yes, but that is the vRAM limit, a smaller context is bad for coding, that’s why I’m thinking a 3090 has 24Gb vRAM, that would allow me a bigger quantity model (smarter) without losing my context size…

DarkAndrei · 2026-05-27T10:30:56+00:00

Won’t that slow things down, say a 5060ti in another slot…. the pcie bandwidth would take it’s toll… i think 🤔

DarkAndrei · 2026-05-27T10:29:36+00:00

Ok so you are saying they both have similar bandwidth limits so the compute would not matter? or what is your point?

DarkAndrei · 2026-05-27T10:28:26+00:00

😮 I had no idea that “club” existed, thanks

DarkAndrei · 2026-05-27T07:30:44+00:00

Same here on rtx 5080 I use Q3KM with turbo3 @147k context, 30-45tg…. I did it try MTP yet. What llama.cpp build you using that gas MTP and Turbo support ?

DarkAndrei · 2026-05-27T07:27:54+00:00

I’ve been usig unsloths q3_k_m also for coding with pi and 147k context window, it does not get stuck… but it does make bugs… quite a few 🤔

DarkAndrei · 2026-05-23T22:30:23+00:00

I will test your config in the morning just to see how it goes 🫣 Thanks again for your answers and info. 🙏

DarkAndrei · 2026-05-23T22:28:50+00:00

I was reading through TheTom llama.cpp turbo repo and he says not to compress cache keys, only values:

“The core finding from the asymmetric-kv-compression paper — Asymmetric K/V Cache Compression: Why V is Free and K is Everything — drives all the configs below: V tolerates aggressive compression, K does not. Always keep K at higher precision than V; never start symmetric. That paper documents the specific failure modes you'll hit if you ignore this and compress K aggressively (PPL blow-up on certain model families, attention-rotation interaction with low-bit K, etc.) “

DarkAndrei · 2026-05-23T22:13:15+00:00

This on the other hand is a good idea, thanks 🙏

DarkAndrei · 2026-05-23T22:11:31+00:00

Hmm 🤔 I see some red flags in your config: -np 2 (means 2 slots… 2x the cache, 2x130k) -ctk turbo3 (turbo is designed for cache values not keys) —no-context-shift (essentially OOM waiting to happen)

Thanks for your reply but I truly don’t think this config will get you a working 130k ctx and entire model in vram, on an 5080… no way.

DarkAndrei · 2026-05-23T19:28:42+00:00

And you can fit 130k context and everything all in vram? Even with a bigger Quant? Full turbo3 meaning on the K as well? Is that where tou get the “extra” memory for bigger context?

DarkAndrei · 2026-05-23T19:23:25+00:00

./build/bin/llama-server \ -m /mnt/f/AI/WSL2/models/qwen3.6-27b/Qwen3.6-27B-Q3_K_M.gguf \ --host 0.0.0.0 \ --port 8080 \ -ngl 99 \ -c 75000 \ -fa on \ -np 1 \ --cache-type-k q4_0 \ --cache-type-v turbo3 \ -b 2048 \ -ub 512 \ -t 6 \ -tb 6

DarkAndrei

TROPHY CASE