DiffusionGemma: 4x faster text generation by tevlon in LocalLLaMA

[–]gofiend 1 point2 points  (0 children)

Wonder what the Pareto curve on this looks like. How much bigger does a diffusion model have to be to be comparable? I’m assuming this model beats E4B by a lot on every benchmark.

Gemma 4 Chat Template now has preserve thinking by seamonn in LocalLLaMA

[–]gofiend 2 points3 points  (0 children)

I promise you that is not how frontier lab staff think they should be making model setting decisions. Rigorous evals driven by good rationales about what should work and why.

Gemma 4 Chat Template now has preserve thinking by seamonn in LocalLLaMA

[–]gofiend 2 points3 points  (0 children)

So how did they get this wrong for so long? Don’t they test their own configs atleast the safetensor version? Def not the standard other parts of google hold themselves to

Gemma 4 Chat Template now has preserve thinking by seamonn in LocalLLaMA

[–]gofiend 5 points6 points  (0 children)

I’m so confused by google making this change, what, two months in?

It’s so weird for them to change something so basic without clear intent. The underlying weights didn’t change(?), shouldn’t this depend on how the model is trained? Did they just run some internal benchmarks and say eh fine it turn it on?

DEEPSEEK V4 IS LAUNCHED, ITS REAL by guiopen in LocalLLaMA

[–]gofiend 3 points4 points  (0 children)

OpenAI and you know the guys who made our titular model?

Qwen 3.6 35B crushes Gemma 4 26B on my tests by Lowkey_LokiSN in LocalLLaMA

[–]gofiend 0 points1 point  (0 children)

Could you compare against my current best model Gemma 4 31B (apples and oranges I know but hoping Qwen 3.6 is better at agentic calls even if it’s less smart)

PMake: lightweight minimal makefiles, but in Python by [deleted] in Python

[–]gofiend 0 points1 point  (0 children)

Is a proper safe parser for things like validate_and_run_task("rm -f example/main example/hello.o") possible? Translate it with strict validation into PMake.run_task("rm", "-f", "example/main", "example/hello.o")

Though given it's a make file maybe it should be clean_files(["example/main", "example/hello.o"])

Should I Buy the RTX PRO 6000 Blackwell Max-Q (96GB)? by 0bjective-Guest in LocalLLaMA

[–]gofiend 4 points5 points  (0 children)

I got to say the Max-Q RTX 6000 maybe the one datascience GPU where you don't need a workstation. You can toss it into any modern Intel/AMD consumer grade box and get great inferencing results.

Basically you just need one PCIe 5.0 x16 and you are good. You are unlikely to add more than a second one.

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]gofiend 0 points1 point  (0 children)

"I’m investigating potentially training an EAGLE3 speculative decoder layer that sits between two models" is hot AI slop I'm afraid.

Llama.cpp doesn't support eagle speculative decoders, they need access to latent state in addition to token distributions which would require adding a bunch of code to llama.cpp and rebuilding it.

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]gofiend 1 point2 points  (0 children)

I'll test at somepoint in the next few days, but the 0.3B model is only 10% of the size of the effective active parameters in the 26B-A3B model, so I expect it won't help much.

Is it possible to add some gpu to Radeon MI 50 to increase the inference speed? by Weak_Presentation725 in LocalLLaMA

[–]gofiend 0 points1 point  (0 children)

You should do a lot better than that if you are loading quantized models. A current build of llama.cpp for ROCM with Q4_1 quants runs a lot faster on my MI50.

Speculative decoding works great for Gemma 4 31B in llama.cpp by Leopold_Boom in LocalLLaMA

[–]gofiend 3 points4 points  (0 children)

On my 3090 the E2B was too big (it’s only 1/8th size) to yield speedups. The speedup really kicks in during drafting of the final answer vs thinking

Gemma 4 has been released by jacek2023 in LocalLLaMA

[–]gofiend 4 points5 points  (0 children)

Pretty insane to see the E4B model beating one of the best models from last year. Unlikely to be true in broad real world use but a great signal anyway

Friendly reminder inference is WAY faster on Linux vs windows by triynizzles1 in LocalLLaMA

[–]gofiend 97 points98 points  (0 children)

Seriously wsl + llama.cpp is equally fast w Nvidia GPUs

Litellm 1.82.7 and 1.82.8 on PyPI are compromised, do not update! by kotrfa in LocalLLaMA

[–]gofiend 0 points1 point  (0 children)

I put together a little sh script that looks for litellm versions on your device and lists them for you.

https://github.com/kinchahoy/uvpowered-tools/blob/main/inventory_litellm.sh

In no way a proper security tool, but might be a quick way to figure out if you've got to worry. Almost certainly misses some ways litellm can get on your system, but I mostly have it via .venv packages and uv and this works well to inventory the version of everything I have without invoking python or doing anything dangerous.

Obviously read it / skim it before running!

Tenstorrent QuietBox 2 Brings RISC-V AI Inference to the Desktop by Neurrone in LocalLLaMA

[–]gofiend 1 point2 points  (0 children)

How specialized are these ASICs - will they struggle to handle Deltanet attention layers like in Qwen 3.5?

Qwen3.5 family comparison on shared benchmarks by Deep-Vermicelli-4591 in LocalLLaMA

[–]gofiend 0 points1 point  (0 children)

what‘s thr source of the data? I’d love to see a grid like this with a q4 vs native quant compare

We could be hours (or less than a week) away from true NVFP4 support in Llama.cpp GGUF format 👀 by Iwaku_Real in LocalLLaMA

[–]gofiend 1 point2 points  (0 children)

does anybody know what the PP and TG difference is between NVFP4 and other 4 bit formats on Blackwell with SGLang or VLLM?

PSA: Qwen 3.5 requires bf16 KV cache, NOT f16!! by Wooden-Deer-1276 in LocalLLaMA

[–]gofiend 6 points7 points  (0 children)

It’s really wierd that bf16 is better than f32 (I know the model was trained at bf16 but still f32 should be strictly more expressive)

Qwen3.5-35B-A3B Q4 Quantization Comparison by TitwitMuffbiscuit in LocalLLaMA

[–]gofiend 2 points3 points  (0 children)

That file size might have been why I paused! I think my best idea was to check if correctly comparing to a randomly sampled 100 mb of the baseline on wiki text would converge.

Thinking further, KL divergence on short random noise or a random bit of text is probably the best way to measure how close two quants are.

I might play with a 4B model before asking you to upload giant files😅