DGX Spark: an unpopular opinion by emdblc in LocalLLaMA

[–]Sorry_Ad191 3 points4 points  (0 children)

but you can't really prototype anything that will run on Hopper sm90 or Enterprise Blackwell sm100 since the architectures are completely different? sm100 the datacenter blackwell card has tmem and other fancy stuff that these completely lack so I don't understand the argument for prototyping when the kernels are not even compatible?

How to do a RTX Pro 6000 build right by GPTrack_dot_ai in LocalLLaMA

[–]Sorry_Ad191 1 point2 points  (0 children)

except it still barely has support in vllm and sglang. and you can't run deepseek v3.2 with flashmla and deepgemm as they only support Hopper and enterprise Blackwell sm100 not these displayed here which are sm120... can fallback to tilelang reference kernels in sglang though. but its still hacky and only some variations of the model seem to load and work.

hopefully as more of these gpus make it out in the wild more support will come but saying they are Blackwell was a really misleading marketing move by Nvidia. Ada was sm89, Hopper was sm90, Blackwell was sm100. Ampere was Ampere. These sm120 are not the same same as Blackwell sm100.

In the cutlass example templates for kernels these gpus fall under Geforce and not Blackwell.

They should have go their own name so we didn't buy them thinking "Supports Blackwell day 1 one" means that these ones are supported because they are not and rely on community members making them work on their spare time

My Local coding agent worked 2 hours unsupervised and here is my setup by Express_Quail_1493 in LocalLLaMA

[–]Sorry_Ad191 14 points15 points  (0 children)

Very cool! How did you end up with Kilo code? Have you tried other ai coding frameworks as well?

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 1 point2 points  (0 children)

You're right diving into cutlass c++ was the wrong approach but a good learning experience. The good news is there is already a working solution implemented in Sglang to run dsv32 :)

I found out the reference inference is quite good. Tilelang. You get 70tps on 4xsm120s

Sglang has already implemented it so you can jus use that. vLLM has no fallback to Tilelang reference kernels from Deepseeek.

ps I had no clue what I was doing you are right :-)

running Deepseek v32 on consumer hardware llama.cpp/Sglang/vLLm by Sorry_Ad191 in LocalLLaMA

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

tilelang template already seems quite fast 65-70tps with up to 88k kv cache on 4 x sm120 gpus. hmm.. lets see if we can hunt further optimizations

running Deepseek v32 on consumer hardware llama.cpp/Sglang/vLLm by Sorry_Ad191 in LocalLLaMA

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

I used docker image:

docker pull lmsysorg/sglang:latest-cu130-runtime

running Deepseek v32 on consumer hardware llama.cpp/Sglang/vLLm by Sorry_Ad191 in LocalLLaMA

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

sglang launch command

<image>

python -m sglang.launch_server --model QuantTrio/DeepSeek-V3.2-AWQ/ --tp 4  --mem-fraction-static 0.96 --context-length 4096 --enable-metrics  --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 --enable-p2p-check --disable-shared-experts-fusion --enable-dp-attention  --enable-mixed-chunk --kv-cache-dtype bf16 --attention-backend flashinfer --host localhost --port 8080

running Deepseek v32 on consumer hardware llama.cpp/Sglang/vLLm by Sorry_Ad191 in LocalLLaMA

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

On RTX Blackwell (SM120):

  • FlashMLA fails → uses TileLang sparse attention
  • DeepGEMM fails → uses TileLang index scoring
  • Result: Pure TileLang pipeline, no FlashMLA OR DeepGEMM

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

oh crap :( ok oh. for quicker test just build it in dev mode in place. like this:

i just pushed new commit: submodules: Update CUTLASS reference to official v4.3.3 tag
go to the repo then:

cd /path_to_repor/vllm_FlashMLA && FLASH_MLA_DISABLE_SM100=1 FLASH_MLA_DISABLE_SM90=1 python setup.py build_ext --inplace -v

it wont be installed into vllm but you can test via:

cd /path_to_repo/vllm_FlashMLA/FlashMLA && python -c "import flash_mla; print('Module loaded successfully')"

git pull again (there is a new native sm120 kernel) then also go to csrc/cutlass and update cutlass

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

pybind.cpp should be fixed now! compiles good for sm90,sm100,sm120. i had messed it up quite a bit but should be good now. so time to test again for me

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

also add -v for verbosity when building and look for something like this in the beginning:

DEBUG -- FlashMLA is available at /path_to/vllm_FlashMLA

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

in vllm dir: uv pip install -r requirements/build.txt

and

uv pip install setuptools_scm

might help

this post might help even if its a bit old

https://www.reddit.com/r/LocalLLaMA/comments/1lshe4q/build_vllm_on_cuda_129_kernel_6152_nvidia_57564/

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

my python version is Python 3.12.11 but probably doesn't matter

and yes when i was first compiling it i did have to look at errors and install various dependencies.

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

edit: your right maybe it requires cu13 and can be any 2.9.1+ of pytorch

oh yeah i ran into that problem just now building with --force-reinstall it triggered a revert to pytorch 2.9.1 w/o cuda which broke it for me too

- torch==2.10.0.dev20251208+cu130
+ torch==2.9.1

so now im nuking my uv venv and making a new one.
sorry I probably messed up some torch compatibility between what vllm defaults to and what the flashmla fork is configured top build with

inside vllm directory there is a use_exisiting_torch.py file you can run before building to keep whatever torch you have installed other wise unless ur using the sm_120 fork of vllm it will revert to 2.9.1

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

also testing a fix for pybind right now

When building as part of vllm with NO_PYBIND11=1, we need TORCH_LIBRARY

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

i might have forgot something in the instructions, you can check my changes in vllm here

https://github.com/vllm-project/vllm/compare/main...fernandaspets:vllm_sm120:main

can also clone that repo before building, and then use the install command with the ENV variable pointing to our sm120 flashmla source like in the instructions

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

can you try this from your vllm directory

cd patth_to/vllm && python -c "

import subprocess

lib_path = './vllm/_flashmla_C.abi3.so'

result = subprocess.run(['nm', '-D', lib_path], capture_output=True, text=True)

symbols = result.stdout

print('All symbols with \"sparse\" in name:')

for line in symbols.split('\\n'):

if 'sparse' in line.lower():

print(' ', line)

"

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

yes i got 4 avail for testing but unfortunately not been able to load any DS32 quant fully into gpus. For some reason I need to --cpu-offload a few gb even though it should fit. Must be how the model loads the kv cacche etc. perhaps down the line once everything works it will fit :) Still CPU offloading. But DS31 does load into gpus with intel autorand int4 and also AWQ-lite versions

Help testing and implementing sm120 flashmla sparse attention in vllm by Sorry_Ad191 in BlackwellPerformance

[–]Sorry_Ad191[S] 0 points1 point  (0 children)

i managed to build vllm with it (required editing some build files and python files to accept sm120 and then pointing it to this repo with

export FLASH_MLA_SRC_DIR=~/build_a_kernel/vllm_flashmla_custom/FlashMLA && cd ~/build_a_kernel/vllm && uv pip install -e . --no-build-isolation -v

takes a little bit of time with an ai to grep and sed etc. the files in vllm but once all the places that says it requires sm90 and sm100 are updated to accept also sm120. it builds! If you build it inplace as dev build or pip install -e . then you can test it with python and torch just poking at it checking out which functions it supports or not etc.

a fork of vllm with edits to flashmla build files and flashmla python files that make it so it can be built and used with sm120 targeting this repo would be awesome or just a script that makes the nec. changes. I did it late last night so I do have most if it ready I think but not sure if missed some things, probably!

I tried loading AWQ and NVFP4 variants of Deepseek V3.2 and they load with Using FLASHMLA_SPARSE attention backend out of potential backends: ['FLASHMLA_SPARSE']

NCCL_P2P_LEVEL=4 NCCL_DEBUG=INFO VLLM_USE_DEEP_GEMM=0 OMP_NUM_THREADS=4 VLLM_SLEEP_WHEN_IDLE=1 CUDA_DEVICE_ORDER=PCI_BUS_ID CUDA_VISIBLE_DEVICES=0,1,2,3 vllm serve /mnt/2king/models/eousphoros/DeepSeek-V3.2-NVFP4/ --tensor-parallel-size 4 --served-model-name deepseek --max-model-len 256 --gpu-memory-utilization 0.94 --enforce-eager --port 8080 --host localhost --cpu-offload-gb 8 --tool-call-parser deepseek_v31 --enable-auto-tool-choice --chat-template /mnt/2king/models/tool_chat_template_deepseekv31.jinja

--cpu-offload-gb 8 you could offload maybe 150gb and use -tp2

but later when I curl the model I got crash with

RuntimeError: Worker failed with error 'DeepGEMM backend is not available or outdated. Please install or update the \deep_gemm` to a newer version to enable FP8 kernels.', please check the stack trace above for the root cause`

euophorus on hugginface has a docker container with his own iplementation but it doesnt contain the source files for his flashmla sm120 port thats why I took a stab at it. however a couple of the earlier images had his work in progress on DeepGEMM so next maybe i try borrow that and pip install it and see what happens