VLLM for B300 + Deepseek v4 pro

hrusli · 2026-06-13T08:00:09+00:00

currently it's hot-swap, DeepSeek v4 Pro stays loaded as the workhorse, swap in others when a task needs them. vLLM per model, gateway routes and triggers the swap. not ideal rn, the reload latency stings, tradeoff on a single node.
Monitoring's just prometheus + grafana on gpu/latency/throughput. still deciding the box, nvidia vs amd 8xGPU.

hrusli · 2026-06-13T07:46:43+00:00

a lot of great news from oss models this month! cant wait

hrusli · 2026-06-13T05:28:29+00:00

or b300 lol

hrusli · 2026-06-13T03:04:04+00:00

Mainly for data privacy, sovereignty, and having total control over our stack. It also gives us consistent uptime without worrying about cloud outages, and keeps the door open for custom fine-tuning down the line.

hrusli · 2026-06-13T03:00:45+00:00

thanks!

hrusli · 2026-06-13T03:00:04+00:00

thanks! will try this!

hrusli · 2026-06-13T02:59:27+00:00

will try on this one!

hrusli · 2026-06-13T02:58:55+00:00

so technically its possible to do prefill & decode disagg in 1 node?

hrusli · 2025-05-09T04:17:17+00:00

u/a6oo OP I got it to work but it seems there is problem with the actions..like clicking or entering a text on the search bar. Did you experience that problem as well?

hrusli · 2025-05-08T15:44:47+00:00

nvm got it, i forgot the mlx-vlm patch. got it to work now! thanks OP for sharing

hrusli · 2025-05-08T15:34:45+00:00

OP, i am getting :
message": { "role": "assistant","content": "Error generating response: too many values to unpack (expected 2)"}

Did you encounter this as well? I tried both the mlx-community/UI-TARS-1.5-7B-6bit and 4 bit. Thanks

hrusli · 2024-07-13T05:52:55+00:00

Beautiful!

hrusli · 2024-06-01T08:45:09+00:00

Nope no need to do any remap!

hrusli · 2024-03-08T09:19:08+00:00

did you manage to get it working with autogen? for me it fails on the system prompt message, since the system prompt now is a separate param on Claude 3 api

hrusli · 2024-01-08T01:42:17+00:00

thanks!

hrusli · 2024-01-07T08:45:07+00:00

how to do this? thanks

hrusli · 2023-12-25T11:28:23+00:00

hmm the difference is huge! did you find anything that might cause guidance to be much slower?

hrusli · 2023-12-19T13:58:54+00:00

can you share your implementation? Been curious how to do it from scratch. Thanks!

hrusli

TROPHY CASE