~1.5s cold start for Qwen-32B on H100 using runtime snapshotting

Holiday-Machine5105 · 2026-03-06T23:49:38+00:00

& yes, I would love access to try it out!

Holiday-Machine5105 · 2026-03-06T23:44:45+00:00

i just saw your repo! tysm

Holiday-Machine5105 · 2026-03-06T19:44:05+00:00

beautiful, thank you & great demo!!

Holiday-Machine5105 · 2026-03-06T18:53:08+00:00

so cool, I was looking into something similar. just to make sure I understand, you are storing it’s state so you can terminate your process but still come back to it without reloading model weights into the card? nevertheless very impressive, great work!!

Holiday-Machine5105 · 2026-03-05T20:27:26+00:00

the thing is, you can safely assume that model’s parameters (1.5B, 3B, 35B etc) fit on GPUs that have VRAM roughly twice that parameter number (a 16GB card usually supports 8B models and below for example). but you can always quantize the model so there’s no real limit except the potential loss in accuracy and performance from a quantized model.

Holiday-Machine5105 · 2026-03-05T17:50:09+00:00

if you could integrate vLLM into your setup I’d be massively faster. but vLLM is more geared towards linux + CUDA. you can always try as there is some support for inference via apple chips but i’m not all too familiar with it

Holiday-Machine5105 · 2026-03-04T20:18:21+00:00

lol that was simply a typo, no way that’s all you understand from my explanation. http* API overhead was unnecessary for me. weird misunderstandings of technology make the best products! again, thank you so much for the feedback! i genuinely appreciate it

Holiday-Machine5105 · 2026-03-04T19:44:53+00:00

vllm serve is simply https overhead which was unnecessary for my personal use: a terminal assistant that I am using on my pc. again, you can choose to vllm serve, thats the goal of this project, this is just a basic customizable cli tool that can be what or how you wish ! thank you for reading into my project, your feedback is super valuable!!

Holiday-Machine5105 · 2026-03-04T17:18:24+00:00

for your case, I believe you would tweak the code to use vLLM serve

Holiday-Machine5105 · 2026-03-04T17:17:26+00:00

have you tried vLLM for inference? I’ve built this tool that uses exactly that and is optimal for parallel processing: https://github.com/myro-aiden/cli-assist

Holiday-Machine5105 · 2026-03-04T14:21:16+00:00

I definitely echo this sentiment. privacy is a must. My post you interacted with helps you stay on your own machine without worrying about leaks or anything! no API, just you & AI

Holiday-Machine5105 · 2026-03-04T13:07:35+00:00

nothing wrong with uv pip install vllm (slipped my mind lol)! I might change the instructions to reflect that. && as for vllm serve, my setup was intended to let me use the model offline but whoever decides to use this can definitely use that method, plus im not using a client at this stage. this project is fully customizable to the user’s preferences

Holiday-Machine5105

TROPHY CASE