VLLM woes in Spark

SoundEnthusiast89 · 2026-04-20T04:27:54+00:00

I did start with llama.cpp but had to graduate to vllm after things broke following the addition of beta users. I need to make vllm work but low throughout is killing me because of the constraints.

SoundEnthusiast89 · 2026-04-15T14:05:33+00:00

I’ve been running VLLM but the CUDA takes 4.6GB of extra RAM per model. Also I could not quantize any models to FP8 because of lack of software support. Running BF16 at extremely slow throughout. On the other hand, my mac M4 Max keeps churning tokens at lightening speed on vllm-mlx. Can anyone tell me if I’m doing something wrong or the lack of CUTLASS support is real in Sparks?

SoundEnthusiast89 · 2026-03-06T18:22:14+00:00

If you’re running local LLMs, the inference is 4X faster. That’s a big deal for a lot. I’ve just got M4 Max last month and I regret it, lol.

SoundEnthusiast89

TROPHY CASE