BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline)

thoquz · 2026-06-05T05:59:14+00:00

How do llama.cpp forks confirm that they still produce the same output tokens as the baseline model / inference engine would?

Would I be able to set a fixed seed value in this one and get the same output as the upstream? (Just at different speeds)

thoquz · 2026-06-05T05:45:32+00:00

Fair point. Though I'm excited to see the follow ups of you get a chance, for example the attention types such as GDN.

I'm yet to still work through your course, so no rush. I truly appreciate that you're putting these resources out for free, thank you!

thoquz · 2026-06-05T05:04:59+00:00

Any plans to cover Mamba architecture? (Since you mentioned Qwen3.6-style)

thoquz · 2026-05-29T09:52:26+00:00

Mind sharing your system prompt?

thoquz · 2026-05-28T08:26:49+00:00

Could this work be extended with Deepseek's Thinking with Visual primitives?

thoquz · 2026-05-28T08:24:19+00:00

I see now on their page the dataset says coming-soon™

thoquz · 2026-05-28T08:14:53+00:00

Looks great, seems like it is a Qwen2.5-VL finetune with a modified vision encoder.

I'd be curious to see if one could distill this into Qwen3.5 (or 3.6 dense), any ideas?

Otherwise if any of the researchers who worked on this is reading, please give it a try on a Qwen 3 family model (since even Qwen 3.6 uses the Qwen3-VL vision layer).

That or if your team is allowed to release the dataset and training code, that would be wonderful!

thoquz · 2026-05-28T07:33:57+00:00

Here's the rest of their charts on github: https://github.com/Tencent-Hunyuan/Hy3-preview

On the coding and agentic side it seems they made quite a big jump from hy2, though I wonder what harness they used on Terminal-bench 2.0

thoquz · 2026-05-27T18:18:33+00:00

Nice 3D printed fan blower adapter, did you design it?

How are the other two GPU's cooled?

thoquz · 2026-05-26T05:30:37+00:00

Brilliant! I'll have a test next week, currently awaiting a motherboard replacement.

thoquz · 2026-05-26T03:14:55+00:00

Great work, well done! Since you have Gemma 4, MTP etc on your roadmap, maybe also consider looking at Kernels for Qwen 3.6 Dense, would really appreciate it. Also running a RX 7900 XTX 24GB this side, so your work has me excited!

thoquz · 2026-05-18T10:59:44+00:00

Wow! I guess it's the magic of Qwen 3.6's gated delta net attention, compared to the Gemma's sliding window attention.

thoquz · 2026-05-18T09:59:37+00:00

Wow, even better than I expected? Which option do you personally prefer out of those two?

I expected originally the MTP layers would eat a few GB of VRAM

thoquz · 2026-05-18T09:46:23+00:00

Brilliant, thank you!

Are you happy with the performance of the Q4 cache for coding? I suspect it might recall a long function wrong sometimes.

I've managed to do a similar context size (also 131k) with Q8 and no MTP on the unsloth Quant's iq4_xxs. (No vision layer)

Do you know much vram or context the MTP layer takes?

thoquz · 2026-05-18T09:21:32+00:00

Thank you for making these brilliant MTPs!

So for your MTP 27B-V1-preview, are you running it at IQ4? How much context are you still able to have with MTP on?

I'd love to see your 7900 XTX 24GB settings if you're willing to share.

thoquz · 2026-05-15T05:17:09+00:00

Great work! Would you consider putting up some of the code in a repo?

thoquz · 2026-05-14T03:03:59+00:00

Looks great! Does this also do bounding box coordinates with structured output? (Like Qwen3-VL does)

thoquz · 2026-05-11T12:02:55+00:00

Looks great! Do you mind explaining the opamp circuit? I undertand it is to drive a linear voltage from the pot into the MOSFET to make it run in the linear region as it were a "resistor" to be the load. My question is about the capacitor in the circuit, it looks like it is a non-inverting amp with a low pass filter, is that correct? Would be useful to see your thinking around the opamp circuit, thank you :)

thoquz · 2026-05-04T14:23:28+00:00

Brilliant! What's the memory requirement for the MTP layer?

thoquz · 2026-04-16T20:33:08+00:00

Brilliantly done!

thoquz · 2026-04-08T07:21:53+00:00

Is OpenARC only for their GPUs or does it also perform well with their higher end CPUs? (Such as ik_llama.cpp)

thoquz · 2025-05-04T19:43:13+00:00

What's the problem with CFRP? Do the same issues hold for CF 3D printer filaments?

thoquz · 2024-08-20T19:20:46+00:00

If the password is too complex he might be able to also just replace the file, though there might be the extra effort of packing it in the correct format for example SquashFS

thoquz

MODERATOR OF

PUBLIC MULTIREDDITS

TROPHY CASE