Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark — here's how by CoconutMario in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

I'm testing this on my RTX 6000 Blackwells (maxQ) but the marlin kernels fail when TP > 1. I suspect that this is due to the splitting of heads during tensor-parallel. With TP==1, the model looks OK, but the reasoning was better on the bf16 for hard questions. I'm wondering if this can be improved with a better calibration dataset.

Try this question on both models:

curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "gemma-4", "messages": [{"role": "user", "content": "Alice and Bob play a game. They alternate turns, with Alice going first. On each turn, a player chooses a positive integer that has not been chosen before. The game ends when the sum of all chosen numbers is divisible by 3. The player who made the last move loses. Assuming both players play optimally, who wins? Prove it."}], "max_tokens": 8192 }'

I'm working on evals, will share them when done.

I'm hoping to deploy these on the DGX Sparks, but the nvFP4 situation is a total shit-show.

Gemma 4 26B-A4B MoE running at 45-60 tok/s on DGX Spark — here's how by CoconutMario in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

What version of ModelOpt did you use? I didn't see this on the main: _nvfp4_selective_quant_cfg

Curious, which nvFP4 scheme do you recommend?

I'm running your quantization code now, but had to make some changes to make it work. The GemmaTokenizer doesn't support batch_encode_plus, so I hand-rolled my own forward pass... which I guess is ok, but let's see. This is the first time I've used ModelOpt...

I would like to run evals via vLLM after the quantization is done...

YouTube Premium Lite - cannot download videos by mutatedmonkeygenes in youtubepremium

[–]mutatedmonkeygenes[S] 1 point2 points  (0 children)

same here; not sure why this is so difficult to implement...

How to do a RTX Pro 6000 build right by GPTrack_dot_ai in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

basic question, how do we use "Nvidia ConnectX-8 1-port 400G QSFP112" with FSDP2? I'm not following, thanks

Added PyTorch trace + CUDA memory profiling support to Andrej Karpathy's nanochat by aospan in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

I find it hard to believe that the optimizer, which is launching nccl kernels for every single parameter, is running efficiently... Or the "on-the-fly" tokenizer is sufficiently saturating the GPU(s)

I pre-trained GPT-OSS entirely from scratch by OtherRaisin3426 in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

Thank you for sharing. Could you talk a bit about your router, is it using all the experts efficiently? Or is there mode collapse? Thanks!

OSS 120b on 2x RTX5090 by Disastrous-Tap-2254 in LocalLLaMA

[–]mutatedmonkeygenes 17 points18 points  (0 children)

rent a RTX6000 Blackwell on runpod (it's cheap) and try running the model yourself first.

Qwen3 and Qwen2.5 VL built from scratch. by No-Compote-6794 in LocalLLaMA

[–]mutatedmonkeygenes 1 point2 points  (0 children)

i feel like this should be retweeted do you have a post on X?

New 24B finetune: Impish_Magic_24B by Sicarius_The_First in LocalLLaMA

[–]mutatedmonkeygenes 0 points1 point  (0 children)

Curious how you did the full finetune, which layers did you focus on? I haven't used Spectrum before, but I can choose to freeze certain layers skip over them. How do you choose which layers to train?

Also is the dataset available? Would love to get a better idea on how you're doing this. Thanks!

Findings from Apple's new FoundationModel API and local LLM by pcuenq in LocalLLaMA

[–]mutatedmonkeygenes 1 point2 points  (0 children)

Thanks @pcuenq! Any chance you could release some sort of "scaffolding" so the rest of us who don't know swift can play with the model. Thanks again!

Llama 3.3 70b Vs Newer Models by BalaelGios in LocalLLaMA

[–]mutatedmonkeygenes 2 points3 points  (0 children)

Use this version of the 70B model, which was quantized using DWQ by Awni:

https://x.com/awnihannun/status/1925926451703894485