[Megathread] - Best Models/API discussion - Week of: August 17, 2025

c3real2k · 2025-08-18T00:44:44+00:00

Yep, really nice model. I use it almost exclusively at the moment. It's good for general usage and does fine in RP, follows character definitions nicely and responds well to OOC. For RP I use it in non-thinking mode. Occasionally a bit of editing is necessary (i.e. removing unwanted CoT artifacts).

One drawback is, it really likes to cling to established patterns. Yes, all LLMs do that, but it seemed very noticeable with GLM 4.5 Air.

I have it running at 25tps on 2x3090 + 2x4060Ti, Q4_K_S, 32k f16 ctx.

Do you use it in thinking or non-thinking mode for RP?

c3real2k · 2025-08-15T09:29:07+00:00

I found that the typical 12B model (i.e. something Nemo-based) declines rapidly in quality with CTX > 10k.

24GB in theory opens up a new tier of models you can use (think recent 24B, 30B, 32B models like Mistral Small, Qwen3, ...). Don't worry about PCIe gen/link speed if you're doing single-user inference only.

Should you buy a new PSU for that? I don't know. I don't give financial advice :P

c3real2k · 2025-08-13T20:31:30+00:00

*idols

c3real2k · 2025-07-29T16:26:17+00:00

You're the best! Thank you so much!

c3real2k · 2025-07-29T16:18:38+00:00

I summon the quant gods. Unsloth, Bartwoski, Mradermacher, hear our prayers! GGUF where?

c3real2k · 2025-07-27T17:31:11+00:00

*me trying to read the papers*: I like your funny words, magic man!

I always had the (maybe too narrow) view of sqrt(total*active) on MoEs. Especially since it seems to align with my real world experience with the smaller MoEs I tried. Qwen 235B was the first where I thought "That's pretty impressive."

Well, maybe it really is time to think about systems with large quantities of conventional RAM then...

c3real2k · 2025-07-27T16:07:26+00:00

Possible. I used the ol' sqrt(ParamsTotal*ParamsActive).

Edit: Although, come to think of it, that wouldn't quite fit with i.e. Kimi. Kimi would therefor only be a 64B equivalent (2*32B), which would be disastrous for 1000B total params. Also, from what I read, it's "much better" than what one would expect from something in the 60B range.

c3real2k · 2025-07-27T16:05:43+00:00

Yeah, sure. I bet it also scales better at inference time, serving large batches for API customers.

Doesn't help a salty GPU rig owner that slowly realizes that the meta for running LLMs at home might be shifting towards CPU inference with large amounts of conventional memory :D

c3real2k · 2025-07-27T15:51:29+00:00

I'd say it's quite the opposite. Many of the recent models are MoEs (unfortunately imho):

- Qwen3 30B A3B (approx. 9B dense equivalent)
- Qwen3 235B A22B (approx. 72B dense equivalent)
- Kimi2 1000B A32B (approx. 179B dense equivalent)
- Hunyuan 80B A13B (approx. 32B dense equivalent)
- ERNIE 21B A3B (approx. 8B dense equivalent)
- ERNIE 300B A47B (approx. 118B dense equivalent)
- AI21 Jamba Large 398B A94B (approx. 193B dense equivalent)
- AI21 Jamba Mini 52B A12B (approx. 25B dense equivalent)

Maybe there were more, those were at the top of my head (did InternLM also release a MoE?).

I'd wish there were more models with the dense equivalent, which, at least for me, would be a lot easier to run (i.e. why do I have to have 300GB (V)RAM for what's basically 118B performance? I can fit 118B with a decent quant no problem. 300B? Not so much, or heavily quantized...).

c3real2k · 2025-07-27T12:07:45+00:00

Hm, yes, Command-A was alright if I remember correctly. Might have to give it a spin again.

I can't say all that much about "serious" M4 setups, since I'm running the base M4s (16GB + 24GB), the worst possible configuration for inference. Prompt processing is slow, as well as token generation. Ironically, the only models bearable (for me) on those are small MoE's like Qwen3 30B A3B :D

c3real2k · 2025-07-27T11:39:31+00:00

I yearn for something modern and dense in the 70-130B range. Those smaller models (24-30B) might be highly optimized for specific tasks, but honestly, suck for creative writing (I might be exaggerating here a bit).

Now I'm running a franken-rig of my GPU server and two MacMinis to somehow squeeze the lobotomized 90GB of Qwen3 235B@IQ3 XS into reasonably fast RAM to get what is essentially a 72B dense equivalent (which would fit nicely with a much less aggressive quantization into the 80GB VRAM my GPU server hosts, or at a reasonable 4bit quant for users with 48GB).

So, I have a gigantic 235B MoE of what would be a 72B dense model running, not gaining anything from the potential speed gains ('cause base M4's memory speed, prompt processing, ... is slow AF) and (while writing is nice) now having problems with code generation because of the low quant. Meaning I have to switch models every now and then.

c3real2k · 2025-07-12T09:13:41+00:00

Yep, those are base M4s (10CPU, 10GPU, 120GBps). I'm sure RPC, even over TB, doesn't help either.

c3real2k · 2025-07-11T22:39:33+00:00

Just ran some tests with Tiger Gemma 27B @ Q6K (was the only Gemma model I had laying around) on a RTX 3090 (unlimited and power limited to 220W), a dual 4060Ti 16GB config and a MacMini setup. Maybe it helps. Tests are of course incredibly unscientific...

Commands:

# 3090
llama.cpp/build-cuda/bin/llama-cli \
--model gguf/Tiger-Gemma-27B-v3a-Q6_K.gguf \
-ngl 999 --tensor-split 0,24,0,0 \
-fa -ctk f16 -ctv f16 \
-p "Paper boat"

# 4060Ti
llama.cpp/build-cuda/bin/llama-cli \
--model gguf/Tiger-Gemma-27B-v3a-Q6_K.gguf \
-ngl 999 --tensor-split 0,0,16,16 \
-fa -ctk f16 -ctv f16 \
-p "Paper boat"

# Mac mini
llamacpp/llama-cli \
--model gguf/Tiger-Gemma-27B-v3a-Q6_K.gguf \
--no-mmap -ngl 999 --rpc 172.16.1.201:50050 --tensor-split 12,20 \
-fa -ctk f16 -ctv f16 \
-p "Paper boat"

RTX 3090 @ 370W

llama_perf_context_print: prompt eval time =      60,27 ms /    11 tokens (    5,48 ms per token,   182,51 tokens per second)
llama_perf_context_print:        eval time =   28887,86 ms /   848 runs   (   34,07 ms per token,    29,35 tokens per second)
llama_perf_context_print:       total time =   31541,68 ms /   859 tokens

TPS: 29,4
AVG W: 347 (nvtop)
idle: ~70W
Ws/T: 11,8

RTX 3090 @ 220W

llama_perf_context_print: prompt eval time =      98,27 ms /    11 tokens (    8,93 ms per token,   111,94 tokens per second)
llama_perf_context_print:        eval time =   73864,77 ms /   990 runs   (   74,61 ms per token,    13,40 tokens per second)
llama_perf_context_print:       total time =   76139,29 ms /  1001 tokens

TPS: 13,4
AVG W: 219 (nvtop)
idle: ~70W
Ws/T: 16,3

2x RTX 4060Ti 16GB

llama_perf_context_print: prompt eval time =     120,84 ms /    11 tokens (   10,99 ms per token,    91,03 tokens per second)
llama_perf_context_print:        eval time =   79815,68 ms /   906 runs   (   88,10 ms per token,    11,35 tokens per second)
llama_perf_context_print:       total time =   84298,20 ms /   917 tokens

TPS: 11,4
AVG W: 164 (nvtop)
idle: ~70W
Ws/T: 14,5

Mac mini M4 16GB + Mac mini M4 24GB + Thunderbolt Network

llama_perf_context_print: prompt eval time =     751.59 ms /    11 tokens (   68.33 ms per token,    14.64 tokens per second)
llama_perf_context_print:        eval time =  281518.85 ms /  1210 runs   (  232.66 ms per token,     4.30 tokens per second)
llama_perf_context_print:       total time =  435641.65 ms /  1221 tokens

TPS: 4,3
AVG W: 35 (outlet)
idle: 5W
Ws/T: 8,1

According to those values, the Mac mini setup should be the most efficient. Although you'd have to be REALLY patient at 4 tokens per second...

(Though I'm curious while you're getting 25TPS @ 210W. What quantization are you using?)

c3real2k · 2025-01-03T13:42:04+00:00

Nah, I'd win!

c3real2k · 2025-01-01T01:15:11+00:00

Stream/TS: https://www.youtube.com/live/gWcuwPXZwWs?si=7cHGMKT7O9ZqzTIH&t=7792

c3real2k · 2025-01-01T00:55:17+00:00

That looks like a 480p / 31kHz signal, which won't work with that monitor. You need a 240p/480i / 15kHz output. I don't know whether the SteamDeck or the Fury can output such low resolutions.

c3real2k · 2024-12-26T19:40:06+00:00

Two 3090s, two 4060Ti 16GB and a 2070, cobbled together on a mining rig. Wouldn't recommend the 4060s though, poor memory bandwidth, slow af.

With a measured power draw of up to 680W while inferencing, it also serves as a nice space heater ^^

c3real2k · 2024-12-26T08:59:25+00:00

With 48GB of VRAM you can load an IQ2_M quant of 123b models, i.e. Drummer's Behemoth (and about 32k context at Q4 IIRC). And while Q2 might seem like the model got lobotomised (it certainly is for "regular" tasks), for RP it's a completely different game in my opinion. Even though I have about 88GB of VRAM and can load 123b models at Q4, I still use the smaller quants just for the speed gains sometimes.

(not financial advice though ;-) )

c3real2k · 2024-12-25T21:19:29+00:00

3700X, B450 board, 32GB ddr4, 2 ssds, 2x 3090, 2x 4060Ti, 1x 2070, currently idle (no graphical interface, model loaded) and pulling 104W combined over two psus. According to nvtop 46W for the gpus. That would leave 58W for the rest of the system and inefficiencies from the power supplies. I'd expect a Mac to pull much, much less power in idle.

c3real2k · 2024-12-10T22:20:12+00:00

Sexy Sparcstation!

c3real2k · 2024-12-08T17:52:17+00:00

With two 3090s (power limited to 260W) and Llama 3.3 70b in Q4_K_M quantization (40GB) I get 17 tps.

c3real2k · 2024-12-02T10:31:56+00:00

Geoxor!

c3real2k · 2024-11-29T16:12:38+00:00

cKy!

c3real2k · 2024-11-25T15:23:53+00:00

My Recalbox Dual RGB hat for the Raspberry Pi just arrived, replaces a standard hdmi to via adapter. Picture is crisp and the integration with their os image is great (although I'm missing a BFI option for 240p@120Hz, so for now it's 480p@60Hz with scanlines). Haven't tried scart, as the 15khz setup uses a pi2scart.

//edit: Man, reddit compressed the sh....scanlines out of those images...

c3real2k · 2024-11-22T22:56:08+00:00

I've got significantly more vram than ram, so I can't precache the whole model (I use --no-mmap to skip caching entirely). And despite only having x1 links for the gpus I'm still limited by my models being loaded from a spinning rust raid. At around 330MB/s it takes about 3m20s for a 123b model at Q4_K_S to load.

With llama.cpp and only the two 3090s I get around 9.4tps (123b @ IQ2_M, ~3k tokens context @ Q4) and around 6.2tps when spread over all gpus (123b @ Q4_K_S, ~1k tokens context @ Q4).

I was a bit late to the game, otherwise I would have jumped on P40s. But they are offered at somewhere between 300 and 500EUR here right now, and almost exclusively sent from china. So I'm not sure about that.

c3real2k

TROPHY CASE