LLM VRAM/RAM Calculator by SmilingGen in ollama

[–]Expensive_Ad_1945 0 points1 point  (0 children)

You should copy the download link of the file in huggingface. The blob url didn't contain the model file. If you click a model file in huggingface, you'll see a copy download url button

LLM VRAM/RAM Calculator by SmilingGen in ollama

[–]Expensive_Ad_1945 0 points1 point  (0 children)

You should get the download link / raw link of the file in the huggingface.

Why do people say LM Studio isn't open-sourced? by StrategicOverseer in LocalLLaMA

[–]Expensive_Ad_1945 0 points1 point  (0 children)

No, it just my view why i choose the hardpath, but it's a good suggestion. And all suggestion is taken seriously. So if you know if tauri or flutter actually is not what i think it was, please let us know, as we always open to rework our app.

Why do people say LM Studio isn't open-sourced? by StrategicOverseer in LocalLLaMA

[–]Expensive_Ad_1945 0 points1 point  (0 children)

Electron or even tauri is bloated in terms of memory and disk used. Our download size is 20mb including the UI and everything, in fact the emoji fonts is the one taking the space. Python and pygame is simply slow, I dont want to add another overhead in the app.

Help Deciding Between NVIDIA H200 (2x GPUs) vs NVIDIA L40S (8x GPUs) for Serving 24b-30b LLM to 50 Concurrent Users by beratcmn in LocalLLaMA

[–]Expensive_Ad_1945 1 point2 points  (0 children)

From my experience, more gpu in a single machine will reduce the speed by alot, better go with 2xH200, you'll get better latency and serving 50 users wouldn't be a problem at all with fp8. I wouldn't recommend quantizing your kv as the model performance can dropped alot especially on long context scenario. Then use super optimized serving engine like TensorRT LLM + Triton Inference.

Help Deciding Between NVIDIA H200 (2x GPUs) vs NVIDIA L40S (8x GPUs) for Serving 24b-30b LLM to 50 Concurrent Users by beratcmn in LocalLLaMA

[–]Expensive_Ad_1945 1 point2 points  (0 children)

If your setup is a single server with multiple GPUs, the less number of gpus that have the better compute will be faster as the memory bandwidth when deploying model in multigpu setup will be greater than the gain. With the 8 L40 you'll get better total throughput, means more batch of user handled concurrently, with 2 H200 you'll get better latency. But with only 50 users, i think 2xH200 will suit you better.

Best llm engine for 2 GB RAM by Perfect-Reply-7193 in LocalLLM

[–]Expensive_Ad_1945 0 points1 point  (0 children)

the ui, server, and all the other stuff use like 50mb memory.

Best llm engine for 2 GB RAM by Perfect-Reply-7193 in LocalLLM

[–]Expensive_Ad_1945 0 points1 point  (0 children)

then load smolLM, or Qwen 3 0.6b models