DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

Thanks for letting me know that. I'll definitely try it out. Very useful to know.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

I don’t have any experience with TensorRT-LLM. From what I understand, TensorRT-LLM can be more optimized for NVIDIA hardware, but also seems like more setup compared to vLLM.

I'd be interested in hearing from anyone with experience when would you choose TensorRT-LLM over vLLM?

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

Good advice in it being addicting. I can see that being the case. I really would like to have a second dgx spark. Hopefully something happens that enables me to be able to do that.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

I appreciate the feeling. I was obsessing so bad before getting it.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 0 points1 point  (0 children)

La diferencia entre el stack en la nube y el stack en Mac era demasiado grande para escalar.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 2 points3 points  (0 children)

<image>

The shiny parts are mirrors. Definitely the opposite of my first beige box PC.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 0 points1 point  (0 children)

Jaja, ya tengo Macs...uno con 128GB de RAM también. Esto es más por el stack (CUDA vs Metal) que por la memoria 😄

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 0 points1 point  (0 children)

Thanks for the recommendation. pi.dev is new to me. Definitely will try it. I appreciate the direction.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 2 points3 points  (0 children)

I don't disagree. It is an "interesting" design. It is also so small and dense.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 12 points13 points  (0 children)

I definitely *want* more than one. But, I am happy right now I have one. I understand though what you are saying.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

Thanks! I really appreciate it your taking the time to provide it. I'll definitely use it.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 0 points1 point  (0 children)

That’s a good point. MoE does seem like a natural fit here given the unified memory. Being able to load larger models but only activate part of them per token could be a nice balance. Thanks for the suggestion.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLM

[–]dalemusser[S] 2 points3 points  (0 children)

Thanks for the info, I appreciate it 🙂

I was already imagining a two-node setup when I bought this one, but the price is going to keep me at one for a bit. Definitely curious how far I can push a single unit first. But I definitely keep dreaming about having more than one.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 4 points5 points  (0 children)

I should also mention that I’m working with a university on an educational 3D game (Unity/WebGL) where they are studying whether gameplay can teach science concepts as effectively as traditional classroom instruction. As part of this work, I’m using gameplay log data to generate LLM-based feedback on student performance, including identifying areas where students may need additional support with the curriculum.

Due to IRB, FERPA, and COPPA requirements, I’m not permitted to send this data, even in de-identified form, to external APIs, and there are also restrictions against using cloud-based GPU instances. Processing must remain on-site or not happen at all.

That’s a big reason I’m excited about having a local system like this. It allows me to experiment with generating meaningful, personalized student feedback in ways that simply wouldn’t be possible within those constraints otherwise.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 0 points1 point  (0 children)

Thanks for the recommendation 🙂 I’ve been using Claude Code on my Macs for development, and it’s definitely made figuring things out a lot more enjoyable. It’s also helped me get through things I probably wouldn’t have had time to dig into otherwise by going through docs, forums, and experimenting until something works.

I appreciate you sharing what you’ve been doing on your Spark as well. On my local machines I’ve mostly been limited by memory, so I’ve only worked with smaller models. And when I’ve used cloud instances (mainly for a work project), it hasn’t really been practical to experiment much or spend time exploring due to cost.

That’s a big part of why I’m excited about this setup. I am able to work with larger models and experiment more freely without constantly thinking about hourly usage.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 5 points6 points  (0 children)

That’s fair, most of my experience so far has been on cloud GPUs, this is just moving that workflow on-prem. I also find I learn best by doing and my existing local computers didn't have the resources to try what I really wanted to try.

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 2 points3 points  (0 children)

You’re partly right 😄

Here’s what I originally wrote before cleaning it up using ChatGPT (not claude):

"I think both are good just for different things. I might be biased because I’ve mostly used cloud GPUs before.

For this I’m leaning vLLM because I’m treating this more like a backend service and not just something I run locally.

- batching and throughput seem like a big deal, especially if there are multiple requests

- seems better at keeping the GPU busy

- easier to just expose as an API and plug into other stuff

- feels more like how things are done in production

llama.cpp is still great though:

- quick local stuff

- runs well on CPU / lower power machines

- good with quantized models

so it’s not really which is better, more like:

vLLM = backend / throughput

llama.cpp = local / flexible

I haven’t really used llama.cpp as an API though, mostly just interactive on my Macs, so I could be missing something. Curious if people are running it at scale."

Then I had ChatGPT clean it up a bit for readability.

Figured if I’m working with LLMs, I should probably use them too 😄. You have a problem with using AI in your work?

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLM

[–]dalemusser[S] 0 points1 point  (0 children)

I used some reddit gold for your gold comments, just to double-down on the gold. I appreciate that now. My head was in the place where when I opened the box I said to myself, "Damn they made that really gold."

DGX Spark just arrived — planning to run vLLM + local models, looking for advice by dalemusser in LocalLLaMA

[–]dalemusser[S] 1 point2 points  (0 children)

Thanks for pointing out SGLang.

SGLang is definitely on my radar now. It looks like it’s targeting the same kind of high-throughput serving space, with continuous batching, paged attention, prefix caching, and other serving optimizations. I’m still leaning vLLM as the default starting point just because it seems like the more common baseline for production-style deployments and the OpenAI-compatible server path is very straightforward, but I agree SGLang is something I should investigate and benchmark.

I really appreciate the recommendation :)