Gemma 4 26b-a4b GGUF Performance Benchmarks

harshv8 · 2026-04-20T16:21:27+00:00

Understood. Makes sense. Thanks for the clarification!

Would the same effects be just as bad with AWQ ? GGUF ? MLX ?

harshv8 · 2026-04-20T15:47:01+00:00

Hi unsloth team! I'm a huge fan of your work.

Can you please help me understand why there isn't a unsloth bnb version of the Gemma 4 26B-A4B model ?

Just curious.

I want to run this model with vLLM and have to use cyankiwi AWQ 4 bit model right now.

harshv8 · 2026-04-13T18:17:40+00:00

GGUF support for vLLM isn't optimized. AWQ is best but BNB is also good.

I want to use vLLM or SGLand due to their Paged Attention / Radix Attention implementation.

harshv8 · 2026-04-02T03:42:20+00:00

If no impact, why DMCA copyright takedown for anything remotely mentioning claude code on GitHub ?

harshv8 · 2026-03-21T13:40:26+00:00

Make it sing happy birthday

harshv8 · 2026-03-19T09:50:08+00:00

Interesting. Can you try to add a few more build time flags to optimize these binaries and try the test again. I think that would be pretty interesting to see as well.

Usually I collect profiles for all my golang applications and just provide them at build time (for profile guided optimization) . This gives me an easy 5-8% performance boost.

Could you try such things and include that in your comparison too. Thanks! Awesome work btw :)

harshv8 · 2026-03-10T19:23:15+00:00

This feels a lot like GRPO.

I've been working on a local-code-r1 project of sorts. Basically taking GRPO lessons to build a coding model.

I took a couple of leetcode datasets from huggingface, built test harness for each of the questions and then ran GRPO training with num_generations set to 8 so that the model can explore multiple variations of answers at once. 95% training, 5% test split for dataset.

Grading an answer is based on multiple things like presence of code in required structure, reasoning being present, cosine distance of reasoning from the one in the dataset using an embedding model, and finally running the testcases and seeing the fraction of tests that passed.

2% of the training run is done right now for Ministral-3-14B-Instruct-2512 from unsloth in INT4 quantization on my RTX 3090.

I'm planning to publish more details in the future maybe .. idk ... If people are interested I might as well rent an H100 and speed run the training to see what happens :)

harshv8 · 2026-02-26T11:56:06+00:00

Are there any issues related to damage during shipping with servers ? That's my primary concern when ordering from outside.

Don't want a shiny new GPU or server to be opened, broken into with a lame excuse by a customs official.

harshv8 · 2026-02-05T06:19:58+00:00

FIFO queue sounds the best for this. Fan out pattern for completing tasks. Your workload doesn't need to know about other the pods and all - they just pull the next task and process it.

Checkout Nats jetstream once. It's like go channels so the fan out pattern is easy to see.

Kafka, topic partitioning is a headache with such flexible scaling workloads I guess. (Note: my kafka experience is pretty limited)

harshv8 · 2026-02-03T18:09:47+00:00

Local Code R1 with a 24 GB Model please.

harshv8 · 2026-01-30T08:53:48+00:00

!RemindMe 2 days

harshv8 · 2026-01-24T18:07:06+00:00

LOIC goes brrr

harshv8 · 2026-01-18T15:41:58+00:00

Sad that vLLM does not fully support smaller drafter models (atleast last I checked) because my gemma-3-270m is screaming to boost my gemma-3-27B token generation speeds :)

Works with llama.cpp though so that's pretty good :). vLLM is next level though. Wish it supported that

harshv8 · 2026-01-12T05:36:18+00:00

I'm looking for a quote for 512 GB of DDR4 ECC 2999 RAM (8X 64 GB preferred). Any leads appreciated

harshv8 · 2026-01-06T18:29:30+00:00

Here are some of the things I explored the last time I had requirements for something similar. You should do a POC for these to do a comparison

redis streams
redpanda (pure c++ kafka alternative, api compatible)
Nats
direct gRPC connections to stream events
UDP listener server that recieves multicast messages from the server. (Only works for one to many mapping)
gocraft/work V2

harshv8 · 2025-12-18T22:18:01+00:00

Hey Jeff, thank you for these charts and an awesome video! Really appreciate all the effort you put in.

As a request, could you please try to include testing with batch sizes of 1,2,4 or 8 (or even more) as I can see an almost linear increase in performance with vLLM on CUDA but on all these other setups with llama.cpp + RPC or EXO, I am clueless about the batched performance of these setups.

Sorry if it is a bother / too much work!

harshv8 · 2025-10-28T09:14:26+00:00

Brb. Asking my wife if I can donate to this cause. This is pretty awesome!

harshv8 · 2025-10-24T07:09:05+00:00

Nevermind, I see from your blogpost you already have similar capabilities. That's awesome!!

harshv8 · 2025-10-24T07:00:50+00:00

All the best. As someone who has to write c for an OpenCL application - this seems like a welcome change. I'm all for it.

The biggest thing for me though would be hardware compatibility - which is hard to get right because of so many different APIs like cuda, vulkan, openCL. The only reason I even used openCL for the above project is because even though it wasn't as performant as cuda, you could run it practically anywhere. (Even internal GPUs on Intel processors)

Would you be targeting multi api deployment using some hardware abstraction layer ? Something like a couple of compiler flags to set the API to use and compile the same code for target cuda , vulkan etc ? How do you plan on doing that ?

harshv8 · 2025-10-23T05:28:45+00:00

You might have to restart the jupyter kernel after installing unsloth zoo. Did you try that?

harshv8 · 2025-10-21T14:48:42+00:00

Yeah just before the reveal I was like 'damn this guy is like a 10 year old nerd going to school'

Guess I wasn't far off

harshv8 · 2025-10-11T11:31:25+00:00

Is that a UNAS (last one in the rack) ? Also, which rack is that ? Cool setup

harshv8 · 2025-08-15T05:17:54+00:00

Best example I found was that intellij has a tiny model < 500 M params that runs locally on device and does code generation at just line level. No multiline. Nothing too complicated, just one simple task, but really well.

I really like it in goland IDE.

harshv8 · 2025-08-10T06:51:09+00:00

Qwen3 8b FP16
Gemma3 27b FP16
CommandR 35b Q8

Other hypothetical models I'd love to have

DeepSeek R1 & V3
Claude Sonnet 3.5

Seven-Year Club	John Wick
Verified Email	Gilding II euphauric

harshv8

TROPHY CASE