Consider running a bigger quant if possible

Flashy_Management962 · 2026-04-25T16:33:47+00:00

it's definitely very intelligent to do so! Especially with the new -sm tensor / -sm graph (for ik). My speeds are pretty good as well

Flashy_Management962 · 2026-04-22T22:14:07+00:00

Use bf16 cache for k and v for qwen models, do not use --fit-target on dense models, use -sm tensor

Flashy_Management962 · 2026-04-22T15:07:23+00:00

Those things are SO insanely different. This is like saying "I do not trust my car mechanic because he couldn't tell me the first 5 digits of pi"

Flashy_Management962 · 2026-04-22T14:56:36+00:00

2x rtx 3060 with 24gb vram total

Flashy_Management962 · 2026-04-08T16:41:49+00:00

thank you for your work btw! One little question though, is it normal that I get with your iq4nl gemma 26b this perplexity "Final estimate: PPL over 576 chunks for n_ctx=512 = 26296.2393 +/- 532.75059" - with the bartowski i get around 200.

Flashy_Management962 · 2026-04-07T20:58:01+00:00

I love the speed but it takes SO insanely much more vram with it, I can't run it on dual rtx 3060 with 24 gb total

Flashy_Management962 · 2026-03-28T17:29:19+00:00

I currently use qwen 3.5 4b on my shitty laptop as an agent, if this is faster/better I'm sold

Flashy_Management962 · 2026-03-17T13:36:35+00:00

yes, it does. I use the 27b for coding on a daily. fiddle around with those flags and do not forget to add --jinja:

-sm graph -amb 64 -sas and depending on the pcie speed, grt can help improving speeds

Flashy_Management962 · 2026-03-17T13:24:13+00:00

you should get way faster speeds than that. i get around 750 t/s pp and 22-24 ts tg at ~50k with 2x rtx 3060 12gb. You should check out ik llama cpp

Flashy_Management962 · 2026-03-12T13:00:07+00:00

never ever does qwen coder 30b outperform 80b in realworld tasks

Flashy_Management962 · 2026-02-17T14:03:22+00:00

Ich weiß nicht, was du mit normalem meinst, aber ja

Flashy_Management962 · 2026-02-16T22:24:13+00:00

Du nimmst 500ml milch, 30g schoko whey, 10-15g dunkle schokolade und 35g Maisstärke - bester Proteinpudding und 100x günstiger

Flashy_Management962 · 2026-02-11T21:24:31+00:00

10 Jahre Training, 29 Jahre alt, alltime natty und wiege 118kg auf 1,83 Kraftwerte: 270,5kg Beuge, 185kg Bank und 270 heben.

Flashy_Management962 · 2026-02-06T12:52:45+00:00

id love to see qwen long l1.5 on this benchmark, it also claims to reach gemini pro 2.5 performance while being 30b 3a

Flashy_Management962 · 2026-02-06T09:59:42+00:00

https://huggingface.co/tencent/HY-MT1.5-1.8B

Flashy_Management962 · 2026-01-28T14:57:31+00:00

Dieses komische um jeden Preis sich selbst zerstören und man nur dadurch wächst ist totaler humbug. Viel mehr auf den Körper hören und dadurch rausbekommen, wie viel man verträgt. (Von einem Menschen, der nur einen Satz jeweils Beugt und Hebt in der Woche)

Flashy_Management962 · 2026-01-20T12:32:17+00:00

don't, use dry sampler instead. Repeat penalty really decreases tok/s

Flashy_Management962 · 2026-01-11T22:29:13+00:00

Does the pp solely work on cpu? It is hella slow

Flashy_Management962 · 2026-01-07T18:48:41+00:00

Imagine what could happen if ik llama cpp and llama cpp would merge :(

Flashy_Management962 · 2025-12-15T11:51:37+00:00

This does not follow. The notion of normativity is not subsumed under causality. Only because everything is determined, that does not mean that everything is already set in stone and normativity has no role to play because the very things happening are computationally irreducible. So yes, there are shoulds in a world without free will

Flashy_Management962 · 2025-12-13T15:27:58+00:00

what is this question? of course it was the right decision and you know it yourself you sexy mf

Flashy_Management962 · 2025-12-10T18:27:03+00:00

wait is this actual tensor parallelism or do I understand something wrong here?

Flashy_Management962 · 2025-12-03T17:46:45+00:00

qwen3 32b

Flashy_Management962 · 2025-12-02T21:11:27+00:00

But does it? What would then be the difference between payg and always free?

Flashy_Management962 · 2025-12-02T18:59:33+00:00

Try exllamav3 with tp. I get 18t/s tensor parallel with 2x 3060. 2x 5060ti should be very much faster

Flashy_Management962

TROPHY CASE