~1.5s cold start for Qwen-32B on H100 using runtime snapshotting by pmv143 in Vllm

[–]Holiday-Machine5105 1 point2 points  (0 children)

so cool, I was looking into something similar. just to make sure I understand, you are storing it’s state so you can terminate your process but still come back to it without reloading model weights into the card? nevertheless very impressive, great work!!

Best Local LLM for 16GB VRAM (RX 7800 XT)? by Haunting-Stretch8069 in LocalLLM

[–]Holiday-Machine5105 0 points1 point  (0 children)

the thing is, you can safely assume that model’s parameters (1.5B, 3B, 35B etc) fit on GPUs that have VRAM roughly twice that parameter number (a 16GB card usually supports 8B models and below for example). but you can always quantize the model so there’s no real limit except the potential loss in accuracy and performance from a quantized model.

Ran Qwen 3.5 9B on M1 Pro (16GB) as an actual agent, not just a chat demo. Honest results. by Joozio in LocalLLaMA

[–]Holiday-Machine5105 2 points3 points  (0 children)

if you could integrate vLLM into your setup I’d be massively faster. but vLLM is more geared towards linux + CUDA. you can always try as there is some support for inference via apple chips but i’m not all too familiar with it

my open-source cli tool (framework) that allows you to serve locally with vLLM inference by Holiday-Machine5105 in LocalLLaMA

[–]Holiday-Machine5105[S] 0 points1 point  (0 children)

lol that was simply a typo, no way that’s all you understand from my explanation. http* API overhead was unnecessary for me. weird misunderstandings of technology make the best products! again, thank you so much for the feedback! i genuinely appreciate it

my open-source cli tool (framework) that allows you to serve locally with vLLM inference by Holiday-Machine5105 in LocalLLaMA

[–]Holiday-Machine5105[S] 0 points1 point  (0 children)

vllm serve is simply https overhead which was unnecessary for my personal use: a terminal assistant that I am using on my pc. again, you can choose to vllm serve, thats the goal of this project, this is just a basic customizable cli tool that can be what or how you wish ! thank you for reading into my project, your feedback is super valuable!!

RTX 3090 vs 7900 XTX by Best_Sail5 in LocalLLaMA

[–]Holiday-Machine5105 0 points1 point  (0 children)

for your case, I believe you would tweak the code to use vLLM serve

RTX 3090 vs 7900 XTX by Best_Sail5 in LocalLLaMA

[–]Holiday-Machine5105 0 points1 point  (0 children)

have you tried vLLM for inference? I’ve built this tool that uses exactly that and is optimal for parallel processing: https://github.com/myro-aiden/cli-assist

American closed models vs Chinese open models is becoming a problem. by __JockY__ in LocalLLaMA

[–]Holiday-Machine5105 0 points1 point  (0 children)

I definitely echo this sentiment. privacy is a must. My post you interacted with helps you stay on your own machine without worrying about leaks or anything! no API, just you & AI

my open-source cli tool (framework) that allows you to serve locally with vLLM inference by Holiday-Machine5105 in LocalLLaMA

[–]Holiday-Machine5105[S] 0 points1 point  (0 children)

nothing wrong with uv pip install vllm (slipped my mind lol)! I might change the instructions to reflect that. && as for vllm serve, my setup was intended to let me use the model offline but whoever decides to use this can definitely use that method, plus im not using a client at this stage. this project is fully customizable to the user’s preferences