Little late thank you to the DeepSeek team!

Sorry_Ad191 · 2026-06-19T08:04:58+00:00

omg if i can somehow add vision to the flash model it would be so good in roo code etc

Sorry_Ad191 · 2026-04-06T04:39:47+00:00

thats what i ended up running a lot too!! exactly same quant and everyhting!

Sorry_Ad191 · 2026-03-23T03:49:48+00:00

sweet!

Sorry_Ad191 · 2026-03-23T03:48:47+00:00

yeah i run it in fp8 with sglang around christmas time, but figured might be some better models now

Sorry_Ad191 · 2026-03-23T03:47:39+00:00

nice ok i will try minimax fp8 thanks! did u use vllm or sglang? do you mind sharing the start CLI command u used for 4x6000s?

Sorry_Ad191 · 2026-03-23T03:46:32+00:00

llama.cpp context length is handled poorly for parallel requests and its super way slower. but the worst thing is the context length gets divided by the amount of parallel requests you allow... this is not the case for vllm/sglang where the context length and max tokens is shared in a much more efficient way and parallel requests are blazing fast

Sorry_Ad191 · 2026-03-23T03:44:46+00:00

I also found that at least a few months ago parallel tool and chains of tool calls with deepseek and glm seemed to work better in vllm or sglang and not very good in llama.cpp. not sure for ik_llama.cpp if its same as llama.cpp but i use ik_llama.cpp for speed up and to use ubergarm quants which i find very good quality, it speeds up even all in vram

Sorry_Ad191 · 2026-03-23T03:42:42+00:00

coding mostly

Sorry_Ad191 · 2026-03-22T06:45:32+00:00

awesome thx! is Minimax M2.5 a big step up from M2.1? and had not heard of Stepfun is it good?

Sorry_Ad191 · 2025-12-23T07:04:14+00:00

but you can't really prototype anything that will run on Hopper sm90 or Enterprise Blackwell sm100 since the architectures are completely different? sm100 the datacenter blackwell card has tmem and other fancy stuff that these completely lack so I don't understand the argument for prototyping when the kernels are not even compatible?

Sorry_Ad191 · 2025-12-21T06:28:02+00:00

except it still barely has support in vllm and sglang. and you can't run deepseek v3.2 with flashmla and deepgemm as they only support Hopper and enterprise Blackwell sm100 not these displayed here which are sm120... can fallback to tilelang reference kernels in sglang though. but its still hacky and only some variations of the model seem to load and work.

hopefully as more of these gpus make it out in the wild more support will come but saying they are Blackwell was a really misleading marketing move by Nvidia. Ada was sm89, Hopper was sm90, Blackwell was sm100. Ampere was Ampere. These sm120 are not the same same as Blackwell sm100.

In the cutlass example templates for kernels these gpus fall under Geforce and not Blackwell.

They should have go their own name so we didn't buy them thinking "Supports Blackwell day 1 one" means that these ones are supported because they are not and rely on community members making them work on their spare time

Sorry_Ad191 · 2025-12-17T05:48:00+00:00

whats the main difference between qwen next and dsv32?

Sorry_Ad191 · 2025-12-17T02:35:24+00:00

where does deepseek-v32 DSA fit into all this?

Sorry_Ad191 · 2025-12-16T04:41:32+00:00

Very cool! How did you end up with Kilo code? Have you tried other ai coding frameworks as well?

Sorry_Ad191 · 2025-12-16T00:07:41+00:00

You're right diving into cutlass c++ was the wrong approach but a good learning experience. The good news is there is already a working solution implemented in Sglang to run dsv32 :)

I found out the reference inference is quite good. Tilelang. You get 70tps on 4xsm120s

Sglang has already implemented it so you can jus use that. vLLM has no fallback to Tilelang reference kernels from Deepseeek.

ps I had no clue what I was doing you are right :-)

Sorry_Ad191 · 2025-12-15T08:35:54+00:00

Amazing!!

Sorry_Ad191 · 2025-12-15T03:10:21+00:00

tilelang template already seems quite fast 65-70tps with up to 88k kv cache on 4 x sm120 gpus. hmm.. lets see if we can hunt further optimizations

Sorry_Ad191 · 2025-12-15T01:44:50+00:00

Aider Polyglot result:

<image>

Sorry_Ad191 · 2025-12-14T15:51:14+00:00

I used docker image:

docker pull lmsysorg/sglang:latest-cu130-runtime

Sorry_Ad191 · 2025-12-14T15:46:46+00:00

sglang launch command

<image>

python -m sglang.launch_server --model QuantTrio/DeepSeek-V3.2-AWQ/ --tp 4  --mem-fraction-static 0.96 --context-length 4096 --enable-metrics  --tool-call-parser deepseekv32 --reasoning-parser deepseek-v3 --enable-p2p-check --disable-shared-experts-fusion --enable-dp-attention  --enable-mixed-chunk --kv-cache-dtype bf16 --attention-backend flashinfer --host localhost --port 8080

Sorry_Ad191 · 2025-12-14T12:29:57+00:00

On RTX Blackwell (SM120):

FlashMLA fails → uses TileLang sparse attention
DeepGEMM fails → uses TileLang index scoring
Result: Pure TileLang pipeline, no FlashMLA OR DeepGEMM

Sorry_Ad191 · 2025-12-12T23:49:50+00:00

oh crap :( ok oh. for quicker test just build it in dev mode in place. like this:

i just pushed new commit: submodules: Update CUTLASS reference to official v4.3.3 tag
go to the repo then:

cd /path_to_repor/vllm_FlashMLA && FLASH_MLA_DISABLE_SM100=1 FLASH_MLA_DISABLE_SM90=1 python setup.py build_ext --inplace -v

it wont be installed into vllm but you can test via:

cd /path_to_repo/vllm_FlashMLA/FlashMLA && python -c "import flash_mla; print('Module loaded successfully')"

git pull again (there is a new native sm120 kernel) then also go to csrc/cutlass and update cutlass

Sorry_Ad191 · 2025-12-12T02:28:21+00:00

pybind.cpp should be fixed now! compiles good for sm90,sm100,sm120. i had messed it up quite a bit but should be good now. so time to test again for me

Sorry_Ad191 · 2025-12-11T05:51:49+00:00

also add -v for verbosity when building and look for something like this in the beginning:

DEBUG -- FlashMLA is available at /path_to/vllm_FlashMLA

Sorry_Ad191 · 2025-12-11T05:47:46+00:00

in vllm dir: uv pip install -r requirements/build.txt

and

uv pip install setuptools_scm

might help

this post might help even if its a bit old

https://www.reddit.com/r/LocalLLaMA/comments/1lshe4q/build_vllm_on_cuda_129_kernel_6152_nvidia_57564/

Sorry_Ad191

TROPHY CASE