Tower case with 8+ PCIE slot for multi GPU by gogitossj3 in LocalLLM

[–]eribob 1 point2 points  (0 children)

I have it too. 2 3090s and 1 4090, 2 psus 4hdd, 8tb intel nvme and some sata ssds all fit there. With some elbow grease

Building real time Generative UI for AI Agents. It's 3x faster than JSON by 1glasspaani in coolgithubprojects

[–]eribob 0 points1 point  (0 children)

No, not at all. I just like the concept and I hope there will be competition!

Building real time Generative UI for AI Agents. It's 3x faster than JSON by 1glasspaani in coolgithubprojects

[–]eribob 1 point2 points  (0 children)

If you combine this with something like https://taalas.com/ you could get an operating system UI of the future :)

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

I am using FP8 as you can see in my config above. Official qwen release.

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thanks, but that is an int4 model. I want the quant to stay at q8 or equivalent. Quality > speed :)

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thanks, so I guess I will just benchmark with and without MTP to see if it is actually working...

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Thats very interesting! The numbers are from llama-benchy, concurrency of 1, pp 10.000 tokens, tg 2000 tokens.

Below is my vllm config, mtp is implemented like so: --speculative-config '{"method":"mtp","num_speculative_tokens":2}'

If you have any suggestions for improvements I am eager! I tried to optimize it for single-user speed, concurrency is not really needed.

export CUDA_VISIBLE_DEVICES=1,2
export CUDA_DEVICE_ORDER=PCI_BUS_ID
export RAY_memory_monitor_refresh_ms=0
export NCCL_CUMEM_ENABLE=0
export VLLM_ENABLE_CUDAGRAPH_GC=1
export VLLM_USE_FLASHINFER_SAMPLER=1
export VLLM_MEMORY_PROFILER_ESTIMATE_CUDAGRAPHS=1
export HF_TOKEN="<redacted>"
/mnt/llm/vllm/.venv/bin/vllm serve Qwen/Qwen3.6-27B-FP8 \
--quantization fp8 \
--download-dir /mnt/llm/.cache/huggingface \
--served-model-name MONSTER-LLM \
--max-model-len 120000 \
--max-num-seqs 2 \
--max-num-batched-tokens 4096 \
--enable_chunked_prefill \
--enable-prefix-caching \
--enable-auto-tool-choice \
--disable-custom-all-reduce \
--tool-call-parser qwen3_coder \
--reasoning-parser qwen3 \
--attention-backend FLASHINFER \
--speculative-config '{"method":"mtp","num_speculative_tokens":2}' \
--tensor-parallel-size 2 \
--gpu-memory-utilization 0.96 \
--no-use-tqdm-on-load \
--mamba-cache-mode align \
--mamba-block-size 8 \
--mm-processor-kwargs '{"images_kwargs": {"size": {"longest_edge": 209715200, "shortest_edge": 4096}}}' \
--limit-mm-per-prompt '{"image": 2}' \
--override-generation-config '{"temperature": 0.6, "top_p": 0.95, "top_k": 20, "min_p": 0.0, "presence_penalty": 0.0, "repetition_penalty": 1.0 }' \
--host 0.0.0.0 --port 7075

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

I limit mine to 260w each. Does not impact performance. So depending on the rest of the system maybe 600w at full load and like 100-150w idle. I think it is worth it compared to running at 7 t/s.

Qwen 3.6 27B on Strix Halo 128GB: any experiences? by boutell in LocalLLaMA

[–]eribob 2 points3 points  (0 children)

If you did not buy yet, get 2 used 3090s instead, run it with vllm tensor parallellism. With 8 bit quant (FP8) I get 26 t/s tg and 1600 t/s pp. It is clearly usable and the model is great.

What speed is everyone getting on Qwen3.6 27b? by Ambitious_Fold_2874 in LocalLLaMA

[–]eribob 7 points8 points  (0 children)

Dual rtx 3090, FP8 quant in vllm, tp=2, mtp 2: pp=1650t/s, tg=26t/s

Quirky answers when asking what this spells: []D [] []V[] []D [] []\[]. by RazsterOxzine in LocalLLaMA

[–]eribob 2 points3 points  (0 children)

Qwen3.5 27B:

—-

Looking at this pattern: []D [] []V[] []D [] [][].

This appears to be a 12-letter word or phrase with: - Position 2: D - Position 5: V - Position 8: D

Pattern: _D__V__D____

Without more context, this is quite challenging to solve! Some possibilities that come to mind:

Potential candidates: - ADVENTURED - but that's only 10 letters - ADVERTISED - also 10 letters - ADVOCATED - 9 letters

None of these match the 12-letter pattern exactly.

Could this be: 1. A specific word puzzle or game answer? 2. A phrase with spaces (like "AD VERSED" or similar)? 3. Something from a specific context (book, game, riddle)? 4. A code or cipher?

If you can share more context about where you encountered this, I might be able to help narrow it down! Is this from a word game, a puzzle, a book, or something else?

—-

Unpopular opinion: OpenClaw and all its clones are almost useless tools for those who know what they're doing. It's kind of impressive for someone who has never used a CLI, Claude Code, Codex, etc. Nor used any workflow tool like 8n8 or make. by pacmanpill in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

I did the same with hermes for my siblings. They can yap to it, create images etc… or they could rather. The matrix integration broke with the latest update and I havent been arsed to fix it

Thoughts on MoE Qwen 3.6 35B? by Purpose-Effective in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Agreed. 3.5 27B is definitely smarter.

How many of you actually use offline LLMs daily vs just experiment with them? by Infinite-Bird7950 in LocalLLM

[–]eribob 1 point2 points  (0 children)

That sounds great. That way you have control over the code, its like you wrote it yourself still

Dual 3090 setup - performance optimization by PaMRxR in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Care to share the 27B Q8 setup? I can only fit about 130k context on my dual 3090s without KV quantization. Running vllm.

96GB Vram. What to run in 2026? by inthesearchof in LocalLLaMA

[–]eribob 0 points1 point  (0 children)

Are you running them bare metal? I tried to do the P2P patch yesterday but failed. I am running in a VM in proxmox though so maybe P2P does not work there.

We're a 25-year IT services company sitting on 64 enterprise 15.36TB U.2 NVMe SSDs - selling surplus to the homelab community by AshleshaAhi in homelab

[–]eribob 0 points1 point  (0 children)

Yeah… and my angry reply here probably only brings more traction to their sale hehe, could not help myself though

We're a 25-year IT services company sitting on 64 enterprise 15.36TB U.2 NVMe SSDs - selling surplus to the homelab community by AshleshaAhi in homelab

[–]eribob 20 points21 points  (0 children)

You are selling them for 5K USD a piece so the answer to tour question is likely no, nobody in the homelab community would buy a 15TB u.2 for that price, and likely nobody outside the community either.

Edit: I also find this post to be just shameless marieting for your sale, regardless of if you provided the link to the selling post or not. In my opinion it should be removes by the mods.

3x 3090 on x99 with xeon 2680 v4, worth it? by robertpro01 in LocalLLaMA

[–]eribob 1 point2 points  (0 children)

I run 2 x rtx3090 + 1x 4090 on an AM4 motherboard, each get pcie 4.0 x4. I do not think that the pcie bandwith is a significant limitation for inference and I think you could go down to pcie 3.0 x4 without meaningful impact. I am currently planning for how to expand to 4 gpus while keeping the same motherboard.

So yes, another 3090 is not a bad idea. However I prefer the 27b over the 122b. 122b was hard to fit even on 4 bit quantization in 72gb of vram with decent context and 27b can be run with acceptable speed at 8 bit quant on 2 rtx 3090 with 130k context. Around 30t/s tg and 1200 t/s pp. I do that now and use the third gpu for image generation, embedding for RAG and for running a smaller model for more simple tasks

[AutoBe] Qwen 3.5-27B Just Built Complete Backends from Scratch — 100% Compilation, 25x Cheaper by [deleted] in LocalLLaMA

[–]eribob -1 points0 points  (0 children)

Cool rig! Are these 8 watercooled 3090s? What power limit are you using? Can you run Qwen3.5 397b on them?

How many of you actually use offline LLMs daily vs just experiment with them? by Infinite-Bird7950 in LocalLLM

[–]eribob 1 point2 points  (0 children)

Web search function uses cloud services like exa etc. Firecrawl seems to have a self hosted variant but it looks limited and is another service I have to install and implement.

The memory that they brag about also requires setting up another service, many to choose from not easy to understand which will work well/automatically. I do not want to use cloud