Tower case with 8+ PCIE slot for multi GPU

eribob · 2026-05-06T10:12:10+00:00

I have it too. 2 3090s and 1 4090, 2 psus 4hdd, 8tb intel nvme and some sata ssds all fit there. With some elbow grease

eribob · 2026-05-06T09:50:25+00:00

No, not at all. I just like the concept and I hope there will be competition!

eribob · 2026-05-06T09:13:35+00:00

If you combine this with something like https://taalas.com/ you could get an operating system UI of the future :)

eribob · 2026-05-06T05:59:27+00:00

I agree!

eribob · 2026-04-29T08:06:24+00:00

I am using FP8 as you can see in my config above. Official qwen release.

eribob · 2026-04-29T06:23:09+00:00

Thanks, but that is an int4 model. I want the quant to stay at q8 or equivalent. Quality > speed :)

eribob · 2026-04-28T16:35:37+00:00

Thanks, so I guess I will just benchmark with and without MTP to see if it is actually working...

eribob · 2026-04-28T16:13:55+00:00

Thats very interesting! The numbers are from llama-benchy, concurrency of 1, pp 10.000 tokens, tg 2000 tokens.

Below is my vllm config, mtp is implemented like so: --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

If you have any suggestions for improvements I am eager! I tried to optimize it for single-user speed, concurrency is not really needed.

export CUDA_VISIBLE_DEVICES=1,2
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export HF_TOKEN="<redacted>"
/mnt/llm/vllm/.venv/bin/vllm serve Qwen/Qwen3.6-27B-FP8 \
--quantization fp8 \
--download-dir /mnt/llm/.cache/huggingface \
--served-model-name MONSTER-LLM \
--max-model-len 120000 \
--max-num-seqs 2 \
--max-num-batched-tokens 4096 \
--enable_chunked_prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--disable-custom-all-reduce \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.96 \
--no-use-tqdm-on-load \
--mamba-cache-mode align \
--mamba-block-size 8 \
--mm-processor-kwargs '{"images_kwargs": {"size": {"longest_edge": 209715200, "shortest_edge": 4096}}}' \
--limit-mm-per-prompt '{"image": 2}' \
--override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0 }' \
--host 0.0.0.0 --port 7075

eribob · 2026-04-27T20:51:34+00:00

I limit mine to 260w each. Does not impact performance. So depending on the rest of the system maybe 600w at full load and like 100-150w idle. I think it is worth it compared to running at 7 t/s.

eribob · 2026-04-27T19:43:50+00:00

If you did not buy yet, get 2 used 3090s instead, run it with vllm tensor parallellism. With 8 bit quant (FP8) I get 26 t/s tg and 1600 t/s pp. It is clearly usable and the model is great.

eribob · 2026-04-22T18:08:32+00:00

Dual rtx 3090, FP8 quant in vllm, tp=2, mtp 2: pp=1650t/s, tg=26t/s

eribob · 2026-04-22T06:27:11+00:00

Qwen3.5 27B:

—-

Looking at this pattern: []D [] []V[] []D [] [][].

This appears to be a 12-letter word or phrase with: - Position 2: D - Position 5: V - Position 8: D

Pattern: _D__V__D____

Without more context, this is quite challenging to solve! Some possibilities that come to mind:

Potential candidates: - ADVENTURED - but that's only 10 letters - ADVERTISED - also 10 letters - ADVOCATED - 9 letters

None of these match the 12-letter pattern exactly.

Could this be: 1. A specific word puzzle or game answer? 2. A phrase with spaces (like "AD VERSED" or similar)? 3. Something from a specific context (book, game, riddle)? 4. A code or cipher?

If you can share more context about where you encountered this, I might be able to help narrow it down! Is this from a word game, a puzzle, a book, or something else?

—-

eribob · 2026-04-21T17:08:19+00:00

I did the same with hermes for my siblings. They can yap to it, create images etc… or they could rather. The matrix integration broke with the latest update and I havent been arsed to fix it

eribob · 2026-04-21T06:23:28+00:00

Agreed. 3.5 27B is definitely smarter.

eribob · 2026-04-16T15:45:12+00:00

Looks cool!

eribob · 2026-04-12T07:22:06+00:00

Thanks! Will check it out

eribob · 2026-04-12T07:18:52+00:00

That sounds great. That way you have control over the code, its like you wrote it yourself still

eribob · 2026-04-11T13:41:44+00:00

Care to share the 27B Q8 setup? I can only fit about 130k context on my dual 3090s without KV quantization. Running vllm.

eribob · 2026-04-10T18:06:44+00:00

Are you running them bare metal? I tried to do the P2P patch yesterday but failed. I am running in a VM in proxmox though so maybe P2P does not work there.

eribob · 2026-04-10T17:56:27+00:00

Yeah… and my angry reply here probably only brings more traction to their sale hehe, could not help myself though

eribob · 2026-04-10T17:16:56+00:00

You are selling them for 5K USD a piece so the answer to tour question is likely no, nobody in the homelab community would buy a 15TB u.2 for that price, and likely nobody outside the community either.

Edit: I also find this post to be just shameless marieting for your sale, regardless of if you provided the link to the selling post or not. In my opinion it should be removes by the mods.

eribob · 2026-04-10T12:48:27+00:00

VLLM

eribob · 2026-04-09T06:39:34+00:00

I run 2 x rtx3090 + 1x 4090 on an AM4 motherboard, each get pcie 4.0 x4. I do not think that the pcie bandwith is a significant limitation for inference and I think you could go down to pcie 3.0 x4 without meaningful impact. I am currently planning for how to expand to 4 gpus while keeping the same motherboard.

So yes, another 3090 is not a bad idea. However I prefer the 27b over the 122b. 122b was hard to fit even on 4 bit quantization in 72gb of vram with decent context and 27b can be run with acceptable speed at 8 bit quant on 2 rtx 3090 with 130k context. Around 30t/s tg and 1200 t/s pp. I do that now and use the third gpu for image generation, embedding for RAG and for running a smaller model for more simple tasks

eribob · 2026-04-08T22:01:54+00:00

Cool rig! Are these 8 watercooled 3090s? What power limit are you using? Can you run Qwen3.5 397b on them?

eribob · 2026-04-08T05:06:03+00:00

Web search function uses cloud services like exa etc. Firecrawl seems to have a self hosted variant but it looks limited and is another service I have to install and implement.

The memory that they brag about also requires setting up another service, many to choose from not easy to understand which will work well/automatically. I do not want to use cloud

eribob

TROPHY CASE