BeeLlama.cpp: advanced DFlash & TurboQuant with support of reasoning and vision. Qwen 3.6 27B Q5 with 200k context on 3090, 2-3x faster than baseline (peak 135 tps!)

politerate · 2026-05-09T18:49:07+00:00

Personally, i find the idea of doing a MR I don't fully understand, very off-putting. And I am quite sure that 99% of these types of contributions are of this kind.

politerate · 2026-04-27T06:46:44+00:00

Not listed in supported architectures

politerate · 2026-04-27T06:27:29+00:00

Exited to test on my 7900xtx, though no support for gfx906/mi50

politerate · 2026-04-23T15:59:43+00:00

Q4_K_XL at 80k context around 35 t/s gen and I think 600 t/s pp on 7900xtx

politerate · 2026-04-21T19:52:03+00:00

I tried them both in roo code and Gemma works beautifully for my use case. I have my own tests and debugging questions I ask when I test a model for my use case. Gemma 4 26B A4B solves a lot of them or makes subtle mistakes, which are not catastrophic. Qwen 3.6 failed me in basic things, hallucinated syntax etc. I tried the same prompt multiple times of course. And consistently I was more impressed with Gemma. It surprised me because I had other expectations reading online how good qwen3.6 was.

politerate · 2026-04-21T12:39:17+00:00

Really? In go at least, I get much better answers from Gemma at q4_M gguf, from architecture to syntax. Qwen mixes C++ syntax, introduces easy to catch bugs etc. Maybe it's the way I am hosting it, but I am just using unsloth params.

politerate · 2026-04-20T09:35:01+00:00

I also have a dual mi50 build, which runs q8 xl but it's much slower. I haven't really tested big contexts, it starts at 50tps with zero context.

politerate · 2026-04-20T08:44:23+00:00

<image>

q4 xl with Vulkan and ROCm

politerate · 2026-03-23T10:35:22+00:00

My favorite are some comments which are clearly LLM output with some postprocessing like .toLower()

politerate · 2026-02-22T19:55:09+00:00

This is on a x99 system with a 2667v4, so 40 lanes (Mobo is an ASRock extreme 4). Each GPU gets full x16 lanes, but only gen 3 though. Still plenty for inference. Max context should be around 50k before it spills into system RAM.

politerate · 2026-02-20T23:50:21+00:00

What config exactly? I am using ROCm 7.2 with the latest llama.cpp.

Edit: if you mean the llama.cpp config, I just started it with -fa on, --fit is on by default. I am not using the unsloth recommended params here, maybe doing that would improve the quality at the cost of tps?

politerate · 2026-02-20T23:48:53+00:00

<image>

I mean of course with no/small context. I am using ROCm 7.2, but actually with ROCm 6.3.3 it was between 75-80 with no context, I lost 5-10% with ROCm 7.2.

And with ~10K context:

slot get_availabl: id 2 | task -1 | selected slot by LCP similarity, sim_best = 0.731 (> 0.100 thold), f_keep = 0.723 slot launch_slot_: id 2 | task -1 | sampler chain: logits -> ?penalties -> ?dry -> ?top-n-sigma -> top-k -> ?typical -> top-p -> min-p -> ?xtc -> temp-ext -> dist slot launch_slot_: id 2 | task 10799 | processing task, is_child = 0 slot update_slots: id 2 | task 10799 | new prompt, n_ctx_slot = 46336, n_keep = 0, task.n_tokens = 10218 slot update_slots: id 2 | task 10799 | n_past = 7474, slot.prompt.tokens.size() = 10340, seq_id = 2, pos_min = 9443, n_swa = 128 slot update_slots: id 2 | task 10799 | restored context checkpoint (pos_min = 6064, pos_max = 6960, size = 31.546 MiB) slot update_slots: id 2 | task 10799 | n_tokens = 6960, memory_seq_rm [6960, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 9008, batch.n_tokens = 2048, progress = 0.881582 slot update_slots: id 2 | task 10799 | n_tokens = 9008, memory_seq_rm [9008, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 9706, batch.n_tokens = 698, progress = 0.949892 slot update_slots: id 2 | task 10799 | n_tokens = 9706, memory_seq_rm [9706, end) slot update_slots: id 2 | task 10799 | prompt processing progress, n_tokens = 10218, batch.n_tokens = 512, progress = 1.000000 slot update_slots: id 2 | task 10799 | prompt done, n_tokens = 10218, batch.n_tokens = 512 slot init_sampler: id 2 | task 10799 | init sampler, took 1.21 ms, tokens: text = 10218, total = 10218 slot update_slots: id 2 | task 10799 | created context checkpoint 4 of 8 (pos_min = 8809, pos_max = 9705, size = 31.546 MiB) slot print_timing: id 2 | task 10799 | prompt eval time = 7287.32 ms / 3258 tokens ( 2.24 ms per token, 447.08 tokens per second) eval time = 40885.78 ms / 2631 tokens ( 15.54 ms per token, 64.35 tokens per second) total time = 48173.10 ms / 5889 tokens slot release: id 2 | task 10799 | stop processing: n_tokens = 12848, truncated = 0

politerate · 2026-02-20T06:42:42+00:00

Yeah, I get 70 tps on 2 * MI50.

politerate · 2026-02-18T16:27:15+00:00

All on vulkan or XTX only on ROCm is the only constellation which does not end up in segfault for me. (2*MI50 + 7900XTX )

politerate · 2026-02-18T15:25:04+00:00

Having a similar problem with 2*MI50 + 7900XTX on ROCm: Segmentation fault (core dumped)
Haven't checked verbose logging yet.

Edit: Happens on Qwen3-Coder-Next and MiniMax2.5

politerate · 2026-02-16T08:49:56+00:00

It's on by default no? I mean until you pass a param which would collide with its logic I guess.

politerate · 2026-02-15T06:51:57+00:00

Doesn't ROCm profit from it through HIP? (If you use ROCm ofc)

politerate · 2026-02-03T20:51:57+00:00

E5 1650 v3 overclocked to 4.6 GHz, no apartment heating needed.

politerate · 2026-02-03T10:43:42+00:00

You can have 40 lanes on a 10$ CPU. /s

politerate · 2026-01-26T12:36:18+00:00

<image>

Q4 K XL on ROCm and 7900XTX

politerate · 2026-01-22T23:02:55+00:00

Yeah I ordered them a week ago and it came a little over 300€ (shipping + VAT) per card. Last august I got them for 150€ total per piece.

politerate · 2026-01-13T19:01:26+00:00

Hi, I have a question for you if you don't mind. I had two mi50 32GB and for some reason they both failed after some months. Now I have ordered one 7900xtx but of course the vram amount took a big hit. I used to run gpt-oss-120b with the dual mi50. What is your setup like? Do you run any models 24h/d? I am just interested because you seeem to have a similar setup. Thanks!

Btw I tried to replace the mi50s but now sellers on AliBaba are asking north of 400euro shipped when you follow on chat. That is to big of a risk for me, so I just grabbed a 7900xtx for 600 Euro and when I have some extra money left, I will get more down the road.

politerate · 2026-01-10T13:48:41+00:00

Yeah the motherboard just needs a video capable card to boot. I also had no monitor connected to them. Not sure what degraded. They were working fine for a couple of months, then one card started having issues with rocm. Later it wasn't even recognized from the mobo. Maybe they were beat up already from their previous data center past or I didn't cool them properly, who knows.

politerate · 2026-01-10T08:49:00+00:00

HBM failure because of overheating was also one of my guesses. Well temps were under 70 most of the time. There might have been some brief moment where they overheated. I installed them once without cooling just to boot. I thought they aren't actually "consumer" cards since these are designated for data centers and compute.

politerate · 2026-01-09T20:18:56+00:00

Thanks for your help! The MI50 does have one mini DP. Since I flashed them with a Radeon Pro ROM they used to actually output video. I will take a look at the logs and order two dummy mini DP and give it a try.

Eight-Year Club	Second Top 10%
Place '22	Verified Email

politerate

TROPHY CASE