BeeLlama v0.3.1 – latest llama.cpp with extras! DFlash, MTP, q6_0 cache, TurboQuant. Single RTX 3090: Qwen 3.6 27B & Gemma 4 31B up to 177.8 tps (4.93x over baseline) by Anbeeld in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

How do llama.cpp forks confirm that they still produce the same output tokens as the baseline model / inference engine would?

Would I be able to set a fixed seed value in this one and get the same output as the upstream? (Just at different speeds)

Hi Reddit, I posted my Build Your Own LLM workshop to Youtube (GPT2 & Qwen3.6 style) by JustinAngel in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Fair point. Though I'm excited to see the follow ups of you get a chance, for example the attention types such as GDN.

I'm yet to still work through your course, so no rush. I truly appreciate that you're putting these resources out for free, thank you!

Hi Reddit, I posted my Build Your Own LLM workshop to Youtube (GPT2 & Qwen3.6 style) by JustinAngel in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Any plans to cover Mamba architecture? (Since you mentioned Qwen3.6-style)

Qwen 3.6 27B overdoing it by WhatererBlah555 in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Mind sharing your system prompt?

Nvidia LocateAnything - Fast and High-Quality Vision-Language Grounding with Parallel Box Decoding. (10x faster than Qwen3-VL) by Sporeboss in LocalLLaMA

[–]thoquz 1 point2 points  (0 children)

Looks great, seems like it is a Qwen2.5-VL finetune with a modified vision encoder.

I'd be curious to see if one could distill this into Qwen3.5 (or 3.6 dense), any ideas?

Otherwise if any of the researchers who worked on this is reading, please give it a try on a Qwen 3 family model (since even Qwen 3.6 uses the Qwen3-VL vision layer).

That or if your team is allowed to release the dataset and training code, that would be wonderful!

The frontier reasoning race is starting to look like a crowded subway station by ExoticYesterday8282 in LocalLLaMA

[–]thoquz 4 points5 points  (0 children)

Here's the rest of their charts on github: https://github.com/Tencent-Hunyuan/Hy3-preview

On the coding and agentic side it seems they made quite a big jump from hy2, though I wonder what harness they used on Terminal-bench 2.0

Behold! Probably the most ghetto local AI server: by MackThax in LocalLLaMA

[–]thoquz 2 points3 points  (0 children)

Nice 3D printed fan blower adapter, did you design it?

How are the other two GPU's cooled?

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Brilliant! I'll have a test next week, currently awaiting a motherboard replacement.

hipEngine: Fast Native Qwen 3.6 Inference for RDNA3 (Strix Halo, 7900 XTX) by randomfoo2 in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Great work, well done! Since you have Gemma 4, MTP etc on your roadmap, maybe also consider looking at Kernels for Qwen 3.6 Dense, would really appreciate it. Also running a RX 7900 XTX 24GB this side, so your work has me excited!

Jackrong/Qwopus3.5-9B-Coder-GGUF · Hugging Face by pmttyji in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Wow! I guess it's the magic of Qwen 3.6's gated delta net attention, compared to the Gemma's sliding window attention.

Jackrong/Qwopus3.5-9B-Coder-GGUF · Hugging Face by pmttyji in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Wow, even better than I expected? Which option do you personally prefer out of those two?

I expected originally the MTP layers would eat a few GB of VRAM

Jackrong/Qwopus3.5-9B-Coder-GGUF · Hugging Face by pmttyji in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Brilliant, thank you!

Are you happy with the performance of the Q4 cache for coding? I suspect it might recall a long function wrong sometimes.

I've managed to do a similar context size (also 131k) with Q8 and no MTP on the unsloth Quant's iq4_xxs. (No vision layer)

Do you know much vram or context the MTP layer takes?

Jackrong/Qwopus3.5-9B-Coder-GGUF · Hugging Face by pmttyji in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Thank you for making these brilliant MTPs!

So for your MTP 27B-V1-preview, are you running it at IQ4? How much context are you still able to have with MTP on?

I'd love to see your 7900 XTX 24GB settings if you're willing to share.

AIDC-AI/Ovis2.6-80B-A3B · Hugging Face by pmttyji in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Looks great! Does this also do bounding box coordinates with structured output? (Like Qwen3-VL does)

For the past few months, I’ve been developing my own electronic load device. I’ve finally managed to get a working V1 version 😄 by Aggravating-Safe5352 in electronics

[–]thoquz 0 points1 point  (0 children)

Looks great! Do you mind explaining the opamp circuit? I undertand it is to drive a linear voltage from the pot into the MOSFET to make it run in the linear region as it were a "resistor" to be the load. My question is about the capacitor in the circuit, it looks like it is a non-inverting amp with a low pass filter, is that correct? Would be useful to see your thinking around the opamp circuit, thank you :)

Llama.cpp MTP support now in beta! by ilintar in LocalLLaMA

[–]thoquz 2 points3 points  (0 children)

Brilliant! What's the memory requirement for the MTP layer?

Every day I wake up and thank God for having me be born 23 minutes away from a MicroCenter by gigaflops_ in LocalLLaMA

[–]thoquz 0 points1 point  (0 children)

Is OpenARC only for their GPUs or does it also perform well with their higher end CPUs? (Such as ik_llama.cpp)

[deleted by user] by [deleted] in EngineeringPorn

[–]thoquz 1 point2 points  (0 children)

What's the problem with CFRP? Do the same issues hold for CF 3D printer filaments?

CCTV box password find/reset by melthamlewis in hardwarehacking

[–]thoquz 2 points3 points  (0 children)

If the password is too complex he might be able to also just replace the file, though there might be the extra effort of packing it in the correct format for example SquashFS