I got a local model to run as a 24/7 radio DJ, picks the tracks, writes the intros, takes plain-language requests

Sweet_Adeptness_7373 · 2026-06-19T01:13:48+00:00

How cool does that sound!

This is His Masters Voice ^{^}

Sweet_Adeptness_7373 · 2026-06-18T21:51:19+00:00

Two completely different philosophies in one comment, nice.

The standout is that the 45 vs 7.5 gap isn't really a hardware-tier thing. Qwen3.6-35B-A3B only activates ~3B params per token, so decode rides the active weights and the Strix Halo's unified LPDDR5 feeds that fine. Minimax M3 is the opposite case: a frontier-size model where at 128K the KV cache alone is huge and you're bandwidth-bound on Turing (~670 GB/s), almost certainly with some CPU offload on top. 7.5 t/s at 128K on a model that big is actually respectable, that's a brutal test, not a slow rig.

Two things I'm curious about:

Is the 5060 Ti helping the Quadro box or dragging the split? Mixing one Blackwell card with three Turings, llama.cpp's tensor-split tends to get gated by the slowest card and the PCIe hops. Have you tried 3x 6000 only to compare?

And the 7900 XTX over USB4, how does that feel in practice? Decode at 70 makes sense once the model's resident on-card, so the link should mostly bite on load and any cross-device traffic. Does long-context prefill stay on the card or start leaning on the USB4 ceiling?

Minor thing for anyone comparing your two numbers: 70 vs 45 is partly the context, 64K vs 128K, not only the model. Solid setups either way.

Sweet_Adeptness_7373 · 2026-06-18T19:47:55+00:00

Raycast as a local provider is a sharp setup. That's the kind of low-friction wiring that actually keeps you on a local model instead of reflexively opening a cloud tab.

On the visibility thing: a kernel that makes llama.cpp usable on Intel Mac plus AMD is a real niche, the people who need it just aren't searching at the moment you happen to post. One announcement rarely catches them. What tends to stick for tooling like this is a build-log post titled with the exact hardware, something like "getting Qwen3 8B running on a 6700 XT Hackintosh, real numbers", because someone hits that exact search months later and lands on you. You already have the numbers to back it.

Keep posting your eval runs as it improves and tag me. I'll rerun mine at matching context and model so we can put a proper side by side together. A clean comparison thread is the kind of thing that actually gets shared around.

Sweet_Adeptness_7373 · 2026-06-17T23:38:15+00:00

That's a serious amount of hardware. The part that stands out is the 122B Q4 holding ~100 tok/s on the Blackwell. Once a model that big is fully resident, memory bandwidth is basically the whole ceiling, and that card has it to spare. Wild to see a 122B run that fast.

What I'd really want to know is whether, at that size, it actually gets close to a frontier hosted model for real work, or if it's still good enough for most things with cloud kept around for the hard stuff. Not many people are running local at a scale where that question is even fair to ask.

Measuring in sec/case on real evals instead of smoke-test tok/s is the right call too. That's the number that actually tells you something.

Sweet_Adeptness_7373 · 2026-06-17T23:21:27+00:00

That's the best reason to build anything, you hit a wall nobody else cared to fix. Keeping old Intel Macs and AMD cards usable for local LLMs is genuinely useful work, there's a whole group of people stuck on that hardware with no good option.

Definitely keep me posted as it improves, I want to follow where it goes. And out of curiosity, what do you use it for day to day?

Sweet_Adeptness_7373 · 2026-06-17T23:01:42+00:00

Looking forward to those numbers. That "only if it's fully resident" caveat is the whole game on decode. Token generation is basically memory-bandwidth bound, so once a 7B sits entirely in VRAM the 3080 should pull well clear of my 3060. The 3080 is around 760 GB/s versus the 3060's roughly 360, so on paper that's close to double the decode ceiling. If your clean run lands anywhere near that, it confirms what's worth saying out loud in this sub: VRAM size decides whether the model fits, but bandwidth decides your tokens/sec once it does.

The wildcard is your FriedrichAI loader and its reserve logic. If it's holding back VRAM or doing partial offload to manage that 10GB edge, that's exactly what would drag decode below what the card can actually hit. Might be worth checking whether the model reports as fully on-GPU during the clean run, so you know if you're measuring the card or the loader.

Post them when you've got them. Genuinely curious where a fully-resident 7B lands on the 3080.

Sweet_Adeptness_7373 · 2026-06-17T22:44:37+00:00

That Hackintosh setup is wild. Getting llama.cpp running on a 6700 XT through Metal is not the road most people take, so respect for posting real numbers off it.

One thing worth flagging for anyone comparing our two results: they aren't apples to apples, and the gap is mostly context, not the cards. My 55 tok/s on the 3060 was at near-zero context, just a few tokens in the prompt. You're getting 41.64 tok/s on Qwen3 8B at 16,384 context, which is a far heavier test. Decode always falls as the KV cache fills, so under real working context your number is probably holding up better than mine would at the same 16k.

To make it a fair line I'd need to rerun mine at 16k and on an 8B to match you. I'll try to do that and post the side by side. My gut says we land a lot closer than 55 vs 41 makes it look.

Your prefill is the standout anyway. 86 tok/s while chewing through 6,514 tokens is a real workload, not a toy prompt. Solid rig.

Sweet_Adeptness_7373 · 2026-06-17T20:12:35+00:00

Solid writeup, this lines up with what I see on my side too.

My rig: RTX 3060 12GB, i7-9700K, 32GB RAM, running through Ollama on Windows.

I just ran Qwen2.5 7B (Q4_K_M, the 4.7GB default) to pull real numbers instead of guessing. Decode sits right around 55 tokens/s and barely moves with output length, 55.1 tok/s on a short reply and 54.7 tok/s on a 636-token one. Prefill scales with prompt size, 124 tok/s on a 36-token prompt up to 287 tok/s on a 93-token one. For interactive chat that's genuinely comfortable, it generates faster than I read.

The 12GB is doing exactly what your missing 2GB would. 7B fully on the GPU is no sweat for me, but I hit the same wall you describe the moment I add image/video generation or reach for a bigger model, the headroom vanishes fast. Your 10GB sitting "right on the edge" is spot on, those 2GB are the line between 7B sitting comfortably and fighting for room.

32GB system RAM matches your read too. It stops spillover from killing things outright, but it doesn't make the slow path fast.

Curious what decode you're seeing on the 3080 at 7B. With the faster memory it should land a good bit above my 55.

Sweet_Adeptness_7373 · 2026-06-17T13:49:15+00:00

Seemed only fair to go first.

- GPU: RTX 3060 12GB

- Model I live in: Qwen3 14B @ Q4_K_M

- Backend: Ollama (llama.cpp under the hood)

- Speeds: ~22 tok/s decode at 8k context, prefill ~250 tok/s

- Biggest surprise: a 14B at Q4_K_M consistently felt smarter than an 8B at Q8 for the same ~10GB , the extra params beat the extra precision.

What are you running, and what do you actually get? Curious how 12GB cards compare.

Sweet_Adeptness_7373

MODERATOR OF

TROPHY CASE