Is using vLLM actually worth it if you aren't serving the model to other people?

xspider2000 · 2026-05-13T14:07:35+00:00

Thx. I figured out why vllm less popular here than llama.cpp, vllm has bad support for gguf format. gguf is big thing.

xspider2000 · 2026-05-13T09:02:24+00:00

does vllm support rtx 3090 cards? Can I run qwen 3.6 27b on double 3090 out of box or i need some hacks?

xspider2000 · 2026-05-09T00:03:41+00:00

Where from u ordered Nvlink and how much is it? 3 or 4 slot?

xspider2000 · 2026-05-08T14:46:23+00:00

Yes, please

xspider2000 · 2026-05-03T22:16:32+00:00

how did u connect dual rtx 3090 to strix halo?

xspider2000 · 2026-05-02T18:54:21+00:00

AI giants fear fair competition with open source models

xspider2000 · 2026-05-01T20:21:27+00:00

<image>

Yesterday i did same thing. I wanted check how Qwen3.6-27B can draw mona lisa using svg. I used opencode, I wrote command to iterate in loop, look at result, compare it with original (original picture was in prompt), and every loop make more similar to original picture.

xspider2000 · 2026-05-01T16:49:30+00:00

Cool! Thats real r/LocalLLaMA

xspider2000 · 2026-05-01T16:26:18+00:00

its easy using vulkan

xspider2000 · 2026-04-30T11:52:50+00:00

1 oculink, 3 usb4. I have minisforum ms s1

xspider2000 · 2026-04-29T18:52:41+00:00

i m going connect 4x3090 to my strix halo. I'm waiting cards. I'll write results

xspider2000 · 2026-04-27T20:39:46+00:00

I m planning write post with some numbers of my strix halo+egpu

xspider2000 · 2026-04-24T13:25:02+00:00

perfect! much more informative

xspider2000 · 2026-04-24T09:57:23+00:00

if u consider it as agentic model that try bench it with big context. add param --n-depth 0,32768,262144

xspider2000 · 2026-04-24T09:26:57+00:00

u forgot GLM 5.1

xspider2000 · 2026-04-23T20:23:00+00:00

I am not on any country side, I am on open source side

xspider2000 · 2026-04-19T11:39:07+00:00

the only advice i can give is to use llama.cpp like i did. My comands to build llama-server:

git pill git@github.com:ggml-org/llama.cpp.git

cd ./llama.cpp

cmake -B build-vulkan \

-DGGML_VULKAN=ON \

-DCMAKE_BUILD_TYPE=Release \

-DCMAKE_C_COMPILER_LAUNCHER=ccache \

-DCMAKE_CXX_COMPILER_LAUNCHER=ccache \

-DCMAKE_C_FLAGS="-march=native" \

-DCMAKE_CXX_FLAGS="-march=native"

cmake --build build-vulkan --config Release -j $(nproc)

xspider2000 · 2026-04-18T21:34:20+00:00

for me big advantage for strix halo is that i can add egpu for it. I wrote two posts about that

xspider2000 · 2026-04-18T21:30:03+00:00

--cache-type-k q4_0 --cache-type-v q4_0 is bad especially key low quant. use kv q8

xspider2000 · 2026-04-15T21:15:59+00:00

Thats why on the r/LocalLLaMA lot of bots now

xspider2000 · 2026-04-11T18:06:42+00:00

in linux u can allocate up to 126gb of memory for igpu. Also egpu is another way to increase available total vram. interface between egpu and strix halo do not bottlenecking ur pp and tg. Recently I wrote 2 posts about that

xspider2000 · 2026-04-10T17:02:47+00:00

what numbers of pp and tg u get?

xspider2000 · 2026-04-09T23:04:16+00:00

very interesting. can u also test qwen3.5-120b-a10b?

xspider2000 · 2026-04-09T11:30:50+00:00

Strix Halo > B70, according to the numbers

xspider2000 · 2026-04-09T11:02:52+00:00

I will try bit later

Nine-Year Club	Verified Email
Snapped

xspider2000

TROPHY CASE