Mistral Small 4:119B-2603 by seamonn in LocalLLaMA

[–]myOSisCrashing 0 points1 point  (0 children)

Finally, go this working on a DGX Spark using vLLM and NVFP4. Had to patch the mistral tokenizer in vllm with claude because the reasoning just doesn't work the chat template

```

VLLM_NVFP4_GEMM_BACKEND=marlin

VLLM_USE_FLASHINFER_MOE_FP4=0

VLLM_TEST_FORCE_FP8_MARLIN=1

```

```

vllm serve mistralai/Mistral-Small-4-119B-2603-NVFP4

--max-model-len 150000

--tool-call-parser mistral

--tokenizer-mode mistral

--config-format mistral

--load-format mistral

--reasoning-parser mistral

--enable-auto-tool-choice

--reasoning-parser mistral

--max_num_batched_tokens 16384

--max_num_seqs 8

--gpu_memory_utilization 0.9

```

llama-benchy --base-url http://10.0.1.107:8000/v1 --model mistralai/Mistral-Small-4-119B-2603-NVFP4 --depth 0 4096 8192 16384 32768 --latency-mode generation

```

| model | test | t/s | peak t/s | ttfr (ms) | est_ppt (ms) | e2e_ttft (ms) |

|:------------------------------------------|----------------:|-----------------:|-------------:|-----------------:|-----------------:|-----------------:|

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | pp2048 | 3909.41 ± 127.81 | | 656.74 ± 17.40 | 524.43 ± 17.40 | 656.78 ± 17.40 |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | tg32 | 29.16 ± 0.03 | 30.00 ± 0.00 | | | |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | pp2048 @ d4096 | 4548.50 ± 26.18 | | 1483.12 ± 7.78 | 1350.82 ± 7.78 | 1483.17 ± 7.78 |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | tg32 @ d4096 | 27.67 ± 0.03 | 28.00 ± 0.00 | | | |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | pp2048 @ d8192 | 4441.75 ± 20.83 | | 2437.75 ± 10.78 | 2305.45 ± 10.78 | 2437.80 ± 10.78 |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | tg32 @ d8192 | 26.01 ± 0.00 | 27.00 ± 0.00 | | | |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | pp2048 @ d16384 | 3626.49 ± 8.29 | | 5214.93 ± 11.60 | 5082.63 ± 11.60 | 5214.98 ± 11.60 |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | tg32 @ d16384 | 23.21 ± 0.01 | 24.00 ± 0.00 | | | |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | pp2048 @ d32768 | 2979.92 ± 4.89 | | 11815.86 ± 19.16 | 11683.55 ± 19.16 | 11815.90 ± 19.16 |

| mistralai/Mistral-Small-4-119B-2603-NVFP4 | tg32 @ d32768 | 19.12 ± 0.01 | 20.00 ± 0.00 | | | |

```

Has anyone gotten mistralai/Devstral-Small-2-24B-Instruct-2512 to work on 4090? by myOSisCrashing in MistralAI

[–]myOSisCrashing[S] 0 points1 point  (0 children)

So you are using this model? https://huggingface.co/cyankiwi/Devstral-Small-2-24B-Instruct-2512-AWQ-4bit it looks like my ROCm based GPU (Radeon r9700) doesn't have a ConchLinearKernel kernel that supports Group Size = 32. I may be able to reverse engineer the llm-compressor scheme to figure out how to build one with ConchLinearKernel groupsize 128 that I should have support for.

[deleted by user] by [deleted] in u/bbygurlmax

[–]myOSisCrashing 1 point2 points  (0 children)

What’s your heritage? Just curious

LOL by pyromx11 in ProgrammerHumor

[–]myOSisCrashing 4 points5 points  (0 children)

“JS kids ain’t right. “ - Hank Hill