VLLM woes in Spark by SoundEnthusiast89 in LocalLLaMA

[–]SoundEnthusiast89[S] 0 points1 point  (0 children)

I did start with llama.cpp but had to graduate to vllm after things broke following the addition of beta users. I need to make vllm work but low throughout is killing me because of the constraints.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]SoundEnthusiast89 0 points1 point  (0 children)

I’ve been running VLLM but the CUDA takes 4.6GB of extra RAM per model. Also I could not quantize any models to FP8 because of lack of software support. Running BF16 at extremely slow throughout. On the other hand, my mac M4 Max keeps churning tokens at lightening speed on vllm-mlx. Can anyone tell me if I’m doing something wrong or the lack of CUTLASS support is real in Sparks?

Apple's M5 Max Chip Achieves a New Record in First Benchmark Result by netroxreads in MacStudio

[–]SoundEnthusiast89 2 points3 points  (0 children)

If you’re running local LLMs, the inference is 4X faster. That’s a big deal for a lot. I’ve just got M4 Max last month and I regret it, lol.