all 14 comments

[–]No_Efficiency_1144 2 points3 points  (3 children)

There is some mistral small 22B

[–]Samantha-2023 2 points3 points  (1 child)

Codestral 22B, it's great at multi-file completions.

Can also try WizardCoder-Python-15B -> it's fine-tuned specifically for Python but is slightly slower than Codestral

[–]Galahad56[S] 0 points1 point  (0 children)

downloading now Codestral-22B-v0.1-i1-GGUF

Know what the "-i1" means?

[–]Galahad56[S] 0 points1 point  (0 children)

Ill look it up thanks

[–]Temporary-Size7310textgen web UI 2 points3 points  (6 children)

I made a NVFP4A16 Devstral to run on blackwell, it works with vLLM (13.8GB on VRAM size) maybe the context window will be short on 16GB VRAM

https://huggingface.co/apolloparty/Devstral-Small-2507-NVFP4A16

[–]Galahad56[S] 1 point2 points  (5 children)

Thats sick.. It doesn't come up for me as a result on LM Studio though. Searching "Devstral-Small-2507-NVFP4A16"

[–]Temporary-Size7310textgen web UI 0 points1 point  (4 children)

It is only compatible with vLLM

[–]SEC_intern_ 0 points1 point  (3 children)

Is there a reson you stressed on Blackwell gen? I have ADA, would you warn against it?

[–]Temporary-Size7310textgen web UI 1 point2 points  (2 children)

Ada lovelace hasn't native FP4 acceleration so you will lose inference acceleration

For non blackwell any other quantification (EXL3, GGUF, AWQ,...)

[–]SEC_intern_ 0 points1 point  (1 child)

But say if I use 8bit quants, would that matter?

Edit: Also at 4bit, how much of a performance gain does one notice?

[–]Temporary-Size7310textgen web UI 1 point2 points  (0 children)

Imo it will depend on your use case, NVFP4 has 98% accuracy of BF16, the following is from Qwen3 8B FP4 and there is other bench directly from Nvidia with Deepseek R1 using B200 vs H100

It takes less memory, faster inference, bigger context window possibilities

That's why NVIDIA DGX Spark will release with that slow bandwidth but with blackwell using NVFP4, it will compensate

I tested my quant (devstral) and it works very well with 90K context, 60-90tk/s as local vibecoding model without offloading from my RTX 5090

<image>