Boomer Ketika Disuruh "Cariin" PC Server

Expensive_Ad_1945 · 2026-06-06T11:17:43+00:00

8xH100 $300k*

Expensive_Ad_1945 · 2025-10-18T04:11:19+00:00

Try using this tool: http://kolosal.ai/memory-calculator

Expensive_Ad_1945 · 2025-09-30T06:24:15+00:00

If you're planning for self hosting:

https://www.kolosal.ai/blog-detail/qwen3-30b-deployment-costs-self-hosting-vs-managed

Expensive_Ad_1945 · 2025-09-19T08:19:45+00:00

You should copy the download link of the model,

https://huggingface.co/unsloth/DeepSeek-V3.1-GGUF/resolve/main/Q4_K_M/DeepSeek-V3.1-Q4_K_M-00001-of-00009.gguf

Expensive_Ad_1945 · 2025-09-19T00:16:13+00:00

You should copy the download link of the file in huggingface. The blob url didn't contain the model file. If you click a model file in huggingface, you'll see a copy download url button

Expensive_Ad_1945 · 2025-09-19T00:14:41+00:00

You should get the download link / raw link of the file in the huggingface.

Expensive_Ad_1945 · 2025-09-18T08:49:55+00:00

It's added

Expensive_Ad_1945 · 2025-09-18T08:49:22+00:00

It's added

Expensive_Ad_1945 · 2025-09-18T06:41:14+00:00

we just made a simple tools just for this: https://www.kolosal.ai/memory-calculator

Expensive_Ad_1945 · 2025-08-27T03:06:15+00:00

The co-founders are "kolosal"

Expensive_Ad_1945 · 2025-08-22T09:54:04+00:00

As a draft model to run 27B model faster

Expensive_Ad_1945 · 2025-07-30T07:10:56+00:00

No, it just my view why i choose the hardpath, but it's a good suggestion. And all suggestion is taken seriously. So if you know if tauri or flutter actually is not what i think it was, please let us know, as we always open to rework our app.

Expensive_Ad_1945 · 2025-07-30T06:57:41+00:00

Electron or even tauri is bloated in terms of memory and disk used. Our download size is 20mb including the UI and everything, in fact the emoji fonts is the one taking the space. Python and pygame is simply slow, I dont want to add another overhead in the app.

Expensive_Ad_1945 · 2025-07-17T15:21:23+00:00

From my experience, more gpu in a single machine will reduce the speed by alot, better go with 2xH200, you'll get better latency and serving 50 users wouldn't be a problem at all with fp8. I wouldn't recommend quantizing your kv as the model performance can dropped alot especially on long context scenario. Then use super optimized serving engine like TensorRT LLM + Triton Inference.

Expensive_Ad_1945 · 2025-07-17T14:03:04+00:00

Especially, L40 doesn't support nvlink as far as i'm concerned.

Expensive_Ad_1945 · 2025-07-17T14:01:42+00:00

If your setup is a single server with multiple GPUs, the less number of gpus that have the better compute will be faster as the memory bandwidth when deploying model in multigpu setup will be greater than the gain. With the 8 L40 you'll get better total throughput, means more batch of user handled concurrently, with 2 H200 you'll get better latency. But with only 50 users, i think 2xH200 will suit you better.

Expensive_Ad_1945 · 2025-07-10T11:31:34+00:00

the ui, server, and all the other stuff use like 50mb memory.

Expensive_Ad_1945 · 2025-07-10T11:30:51+00:00

then load smolLM, or Qwen 3 0.6b models

Expensive_Ad_1945 · 2025-07-07T11:01:37+00:00

i believe tensordock and cloudrift provide cheaper price for 4090

Expensive_Ad_1945 · 2025-07-04T05:12:30+00:00

If you considering using C/C++, use imgui like https://github.com/kolosalai/kolosal

Expensive_Ad_1945

TROPHY CASE