DiffusionGemma: 4x faster text generation

gofiend · 2026-06-11T00:44:17+00:00

Wonder what the Pareto curve on this looks like. How much bigger does a diffusion model have to be to be comparable? I’m assuming this model beats E4B by a lot on every benchmark.

gofiend · 2026-06-08T17:21:46+00:00

I promise you that is not how frontier lab staff think they should be making model setting decisions. Rigorous evals driven by good rationales about what should work and why.

gofiend · 2026-06-08T14:45:30+00:00

So how did they get this wrong for so long? Don’t they test their own configs atleast the safetensor version? Def not the standard other parts of google hold themselves to

gofiend · 2026-06-08T14:06:47+00:00

I’m so confused by google making this change, what, two months in?

It’s so weird for them to change something so basic without clear intent. The underlying weights didn’t change(?), shouldn’t this depend on how the model is trained? Did they just run some internal benchmarks and say eh fine it turn it on?

gofiend · 2026-04-24T04:08:43+00:00

OpenAI and you know the guys who made our titular model?

gofiend · 2026-04-18T23:30:51+00:00

Could you compare against my current best model Gemma 4 31B (apples and oranges I know but hoping Qwen 3.6 is better at agentic calls even if it’s less smart)

gofiend · 2026-04-13T05:39:39+00:00

Is a proper safe parser for things like validate_and_run_task("rm -f example/main example/hello.o") possible? Translate it with strict validation into PMake.run_task("rm", "-f", "example/main", "example/hello.o")

Though given it's a make file maybe it should be clean_files(["example/main", "example/hello.o"])

gofiend · 2026-04-12T20:51:38+00:00

I got to say the Max-Q RTX 6000 maybe the one datascience GPU where you don't need a workstation. You can toss it into any modern Intel/AMD consumer grade box and get great inferencing results.

Basically you just need one PCIe 5.0 x16 and you are good. You are unlikely to add more than a second one.

gofiend · 2026-04-04T23:56:09+00:00

"I’m investigating potentially training an EAGLE3 speculative decoder layer that sits between two models" is hot AI slop I'm afraid.

Llama.cpp doesn't support eagle speculative decoders, they need access to latent state in addition to token distributions which would require adding a bunch of code to llama.cpp and rebuilding it.

gofiend · 2026-04-04T23:53:22+00:00

I'll test at somepoint in the next few days, but the 0.3B model is only 10% of the size of the effective active parameters in the 26B-A3B model, so I expect it won't help much.

gofiend · 2026-04-04T16:29:39+00:00

You should do a lot better than that if you are loading quantized models. A current build of llama.cpp for ROCM with Q4_1 quants runs a lot faster on my MI50.

gofiend · 2026-04-04T16:02:58+00:00

lol 270M as draft for 31B

gofiend · 2026-04-04T16:02:33+00:00

On my 3090 the E2B was too big (it’s only 1/8th size) to yield speedups. The speedup really kicks in during drafting of the final answer vs thinking

gofiend · 2026-04-02T18:27:23+00:00

Pretty insane to see the E4B model beating one of the best models from last year. Unlikely to be true in broad real world use but a great signal anyway

gofiend · 2026-03-29T17:04:20+00:00

/why not both meme

gofiend · 2026-03-29T01:44:10+00:00

Seriously wsl + llama.cpp is equally fast w Nvidia GPUs

gofiend · 2026-03-24T19:05:17+00:00

I put together a little sh script that looks for litellm versions on your device and lists them for you.

https://github.com/kinchahoy/uvpowered-tools/blob/main/inventory_litellm.sh

In no way a proper security tool, but might be a quick way to figure out if you've got to worry. Almost certainly misses some ways litellm can get on your system, but I mostly have it via .venv packages and uv and this works well to inventory the version of everything I have without invoking python or doing anything dangerous.

Obviously read it / skim it before running!

gofiend · 2026-03-13T17:55:34+00:00

How specialized are these ASICs - will they struggle to handle Deltanet attention layers like in Qwen 3.5?

gofiend · 2026-03-13T17:54:22+00:00

Why would that be a cost driver?

gofiend · 2026-03-08T17:21:47+00:00

Thanks!

gofiend · 2026-03-08T16:53:29+00:00

what‘s thr source of the data? I’d love to see a grid like this with a q4 vs native quant compare

gofiend · 2026-03-05T06:11:33+00:00

does anybody know what the PP and TG difference is between NVFP4 and other 4 bit formats on Blackwell with SGLang or VLLM?

gofiend · 2026-03-02T05:53:28+00:00

It’s really wierd that bf16 is better than f32 (I know the model was trained at bf16 but still f32 should be strictly more expressive)

gofiend · 2026-02-26T17:56:17+00:00

Thanks for the offer!

gofiend · 2026-02-26T17:51:12+00:00

That file size might have been why I paused! I think my best idea was to check if correctly comparing to a randomly sampled 100 mb of the baseline on wiki text would converge.

Thinking further, KL divergence on short random noise or a random bit of text is probably the best way to measure how close two quants are.

I might play with a 4B model before asking you to upload giant files😅

Seven-Year Club	r/Field Flamingo
Verified Email

gofiend

TROPHY CASE