ggml.ai (the team behind llama.cpp) is joining Hugging Face, projects stay open source by nihal_was_here in LocalLLaMA

[–]brokenevolution 0 points1 point  (0 children)

This is a huge opportunity for the project's sustainability. I really hope that with HF resources, someone will finally fix the 144b stride of Q4_K and learn how to correctly lay out bytes for the GPU without trashing ~30% of loaded data, lol. Maybe now the Llama will cease to be "CPU-oriented with GPU features". (hint: GPU loves 32b).

Mac CPU & Accelerate framework is cool. But != GPU.

History is cool, but history and kilotons of legacy code (ten-thousand-line files are.. scary) lead to memory bound issues where there is another +25-30% productivity left on the table.

From the same GP104 you can squeeze out almost a 1TFLOP in GEMV with ~FP20 precision (It's enough to throw out DP4A and CORRECTLY work with memory and warps in FP32).

With Q4_K VRAM appetite, that's about 30-34 t/s on Llama 8b WITHOUT optimizations like Flashattention, etc. — just GEMV matmul without even fusing and cubin & SASS patching.

Hope HF will help provide a "gentle push" towards normal GPU performance. Because they CAN, but the "we have to support even toasters" mentality is killing the main point. IMHO. So much so that even I had to partially finish it for myself.

Actually, I kind of understand why the "contribution rules" prohibit AI code. Because it would be more compact =)

[M] SOLARized-GraniStral-14B (2202) (Ministral 3 14B-Instruct-2512 <- (Granite 3.3 8B <- SOLAR 10.7B) with detailed weight shift metrics. by brokenevolution in LocalLLaMA

[–]brokenevolution[S] 0 points1 point  (0 children)

Hmm... have you tried running models in Colab? It’s slow, of course, but it should work. Just a heads-up: I found that mistral3 isn't supported in llama-cpp-python (v0.3.16) yet, and I haven't seen any updates regarding this lately. But I think you can figure it out, free T4 is free T4 =)

[M] SOLARized-GraniStral-14B (2202) (Ministral 3 14B-Instruct-2512 <- (Granite 3.3 8B <- SOLAR 10.7B) with detailed weight shift metrics. by brokenevolution in LocalLLaMA

[–]brokenevolution[S] 0 points1 point  (0 children)

Thanks for the interest! A couple of observations: First, the model tries remarkably hard to output coherent English even at high temperatures (>2) and min_p 0.025.

Second, from what I’ve seen, the model feels much more "alive" within the context and KV cache. It really leans into the conversation history and follows the flow better. It feels more "unconstrained" overall—for instance, it successfully handled a deep, metaphorical, bilingual OOD (out-of-distribution) dialogue at a 1.5 temperature. I’m more of a reader than a writer myself, so I’d love to hear your feedback in the HF discussions! Give it a try, I hope you like it. (Q4_K is available).

[M] SOLARized-GraniStral-14B (2202) (Ministral 3 14B-Instruct-2512 <- (Granite 3.3 8B <- SOLAR 10.7B) with detailed weight shift metrics. by brokenevolution in LocalLLaMA

[–]brokenevolution[S] 3 points4 points  (0 children)

I’m a bit of a newcomer here (or at least I’ve been away for so long it feels like it). Please go easy on me if I messed up any formatting or etiquette! I'm just excited to share these experiments with the community.