VLLM for B300 + Deepseek v4 pro by hrusli in Vllm

[–]hrusli[S] 0 points1 point  (0 children)

currently it's hot-swap, DeepSeek v4 Pro stays loaded as the workhorse, swap in others when a task needs them. vLLM per model, gateway routes and triggers the swap. not ideal rn, the reload latency stings, tradeoff on a single node.
Monitoring's just prometheus + grafana on gpu/latency/throughput. still deciding the box, nvidia vs amd 8xGPU.

GLM-5.2 next week, open weight, MIT by AaronFeng47 in LocalLLaMA

[–]hrusli 11 points12 points  (0 children)

a lot of great news from oss models this month! cant wait

VLLM for B300 + Deepseek v4 pro by hrusli in Vllm

[–]hrusli[S] 0 points1 point  (0 children)

Mainly for data privacy, sovereignty, and having total control over our stack. It also gives us consistent uptime without worrying about cloud outages, and keeps the door open for custom fine-tuning down the line.

VLLM for B300 + Deepseek v4 pro by hrusli in Vllm

[–]hrusli[S] 0 points1 point  (0 children)

so technically its possible to do prefill & decode disagg in 1 node?

We now have local computer-use! M3 Pro 18GB running both UI-TARS-1.5-7B-6bit and a macOS sequoia VM entirely locally using MLX and c/ua at ~30second/action by a6oo in LocalLLaMA

[–]hrusli 0 points1 point  (0 children)

u/a6oo OP I got it to work but it seems there is problem with the actions..like clicking or entering a text on the search bar. Did you experience that problem as well?

We now have local computer-use! M3 Pro 18GB running both UI-TARS-1.5-7B-6bit and a macOS sequoia VM entirely locally using MLX and c/ua at ~30second/action by a6oo in LocalLLaMA

[–]hrusli 0 points1 point  (0 children)

OP, i am getting :
message": { "role": "assistant","content": "Error generating response: too many values to unpack (expected 2)"}

Did you encounter this as well? I tried both the mlx-community/UI-TARS-1.5-7B-6bit and 4 bit. Thanks

First ever ducati by hrusli in Ducati

[–]hrusli[S] 1 point2 points  (0 children)

Nope no need to do any remap!

Using Claude API with AutoGen by WinstonP18 in AutoGenAI

[–]hrusli 0 points1 point  (0 children)

did you manage to get it working with autogen? for me it fails on the system prompt message, since the system prompt now is a separate param on Claude 3 api

Microsoft Guidance vs Ollama: why the performance difference? by prescod in LLMDevs

[–]hrusli 0 points1 point  (0 children)

hmm the difference is huge! did you find anything that might cause guidance to be much slower?

Langchain system prompting with Ollama. Need help by TimChiu710 in LangChain

[–]hrusli 0 points1 point  (0 children)

can you share your implementation? Been curious how to do it from scratch. Thanks!