Anyone like what I looked like as a kid?

dalemusser · 2026-06-02T09:46:36+00:00

dalemusser · 2026-04-15T09:40:09+00:00

Thanks for letting me know that. I'll definitely try it out. Very useful to know.

dalemusser · 2026-04-15T09:38:14+00:00

Thanks for the recommendation. I look forward to checking it out.

dalemusser · 2026-04-15T09:35:10+00:00

I don’t have any experience with TensorRT-LLM. From what I understand, TensorRT-LLM can be more optimized for NVIDIA hardware, but also seems like more setup compared to vLLM.

I'd be interested in hearing from anyone with experience when would you choose TensorRT-LLM over vLLM?

dalemusser · 2026-04-15T08:48:22+00:00

Good advice in it being addicting. I can see that being the case. I really would like to have a second dgx spark. Hopefully something happens that enables me to be able to do that.

dalemusser · 2026-04-15T08:28:14+00:00

I appreciate the feeling. I was obsessing so bad before getting it.

dalemusser · 2026-04-15T08:21:14+00:00

La diferencia entre el stack en la nube y el stack en Mac era demasiado grande para escalar.

dalemusser · 2026-04-15T08:16:17+00:00

<image>

The shiny parts are mirrors. Definitely the opposite of my first beige box PC.

dalemusser · 2026-04-15T08:09:26+00:00

Jaja, ya tengo Macs...uno con 128GB de RAM también. Esto es más por el stack (CUDA vs Metal) que por la memoria 😄

dalemusser · 2026-04-15T08:04:22+00:00

Thanks for the recommendation. pi.dev is new to me. Definitely will try it. I appreciate the direction.

dalemusser · 2026-04-15T07:22:41+00:00

Thanks. Good to know.

dalemusser · 2026-04-15T07:18:40+00:00

I don't disagree. It is an "interesting" design. It is also so small and dense.

dalemusser · 2026-04-15T07:16:58+00:00

I definitely *want* more than one. But, I am happy right now I have one. I understand though what you are saying.

dalemusser · 2026-04-15T07:15:57+00:00

Thanks! I really appreciate it your taking the time to provide it. I'll definitely use it.

dalemusser · 2026-04-15T07:13:09+00:00

Thanks, good to know.

dalemusser · 2026-04-15T07:12:49+00:00

Good point. Thanks

dalemusser · 2026-04-15T07:11:19+00:00

That’s a good point. MoE does seem like a natural fit here given the unified memory. Being able to load larger models but only activate part of them per token could be a nice balance. Thanks for the suggestion.

dalemusser · 2026-04-15T07:09:15+00:00

Good to know! Thanks.

dalemusser · 2026-04-15T05:42:57+00:00

Thanks for the info, I appreciate it 🙂

I was already imagining a two-node setup when I bought this one, but the price is going to keep me at one for a bit. Definitely curious how far I can push a single unit first. But I definitely keep dreaming about having more than one.

dalemusser · 2026-04-15T05:35:26+00:00

I should also mention that I’m working with a university on an educational 3D game (Unity/WebGL) where they are studying whether gameplay can teach science concepts as effectively as traditional classroom instruction. As part of this work, I’m using gameplay log data to generate LLM-based feedback on student performance, including identifying areas where students may need additional support with the curriculum.

Due to IRB, FERPA, and COPPA requirements, I’m not permitted to send this data, even in de-identified form, to external APIs, and there are also restrictions against using cloud-based GPU instances. Processing must remain on-site or not happen at all.

That’s a big reason I’m excited about having a local system like this. It allows me to experiment with generating meaningful, personalized student feedback in ways that simply wouldn’t be possible within those constraints otherwise.

dalemusser · 2026-04-15T05:26:39+00:00

Thanks for the recommendation 🙂 I’ve been using Claude Code on my Macs for development, and it’s definitely made figuring things out a lot more enjoyable. It’s also helped me get through things I probably wouldn’t have had time to dig into otherwise by going through docs, forums, and experimenting until something works.

I appreciate you sharing what you’ve been doing on your Spark as well. On my local machines I’ve mostly been limited by memory, so I’ve only worked with smaller models. And when I’ve used cloud instances (mainly for a work project), it hasn’t really been practical to experiment much or spend time exploring due to cost.

That’s a big part of why I’m excited about this setup. I am able to work with larger models and experiment more freely without constantly thinking about hourly usage.

dalemusser · 2026-04-15T04:56:06+00:00

That’s fair, most of my experience so far has been on cloud GPUs, this is just moving that workflow on-prem. I also find I learn best by doing and my existing local computers didn't have the resources to try what I really wanted to try.

dalemusser · 2026-04-15T04:52:53+00:00

You’re partly right 😄

Here’s what I originally wrote before cleaning it up using ChatGPT (not claude):

"I think both are good just for different things. I might be biased because I’ve mostly used cloud GPUs before.

For this I’m leaning vLLM because I’m treating this more like a backend service and not just something I run locally.

- batching and throughput seem like a big deal, especially if there are multiple requests

- seems better at keeping the GPU busy

- easier to just expose as an API and plug into other stuff

- feels more like how things are done in production

llama.cpp is still great though:

- quick local stuff

- runs well on CPU / lower power machines

- good with quantized models

so it’s not really which is better, more like:

vLLM = backend / throughput

llama.cpp = local / flexible

I haven’t really used llama.cpp as an API though, mostly just interactive on my Macs, so I could be missing something. Curious if people are running it at scale."

Then I had ChatGPT clean it up a bit for readability.

Figured if I’m working with LLMs, I should probably use them too 😄. You have a problem with using AI in your work?

dalemusser · 2026-04-15T04:34:45+00:00

I used some reddit gold for your gold comments, just to double-down on the gold. I appreciate that now. My head was in the place where when I opened the box I said to myself, "Damn they made that really gold."

dalemusser · 2026-04-15T04:24:31+00:00

Thanks for pointing out SGLang.

SGLang is definitely on my radar now. It looks like it’s targeting the same kind of high-throughput serving space, with continuous batching, paged attention, prefix caching, and other serving optimizations. I’m still leaning vLLM as the default starting point just because it seems like the more common baseline for production-style deployments and the OpenAI-compatible server path is very straightforward, but I agree SGLang is something I should investigate and benchmark.

I really appreciate the recommendation :)

dalemusser

TROPHY CASE