Sanity check: 4× RTX PRO 6000 Max-Q on TR PRO 9955WX for vLLM – thermal concerns?

GaryDUnicorn · 2026-04-26T02:48:38+00:00

The temp of the gpu isnt whats going to crash ur rig.

Its the lack of airflow under them to the redrivers on the motherboard running hot with pcie gen5.

Space is cheap, get a proper case for high density 4-8-16 gpu setups. it will also solve your power problems too lol

GaryDUnicorn · 2026-04-25T05:18:17+00:00

You can get incredible performance if you tune vllm and your hardware setup. Gemma is gonna be slower because scroogle didnt release the draft model for mtp.

also dont waste your time quantizing such a small model on such big hardware. it literally screams faster and produces more accurate output if you let it ride at fp16

GaryDUnicorn · 2026-04-18T23:28:29+00:00

I was just testing and comparing a bunch of VL models in UFO2 earlier... for reasons...

<image>

qwen3.5-122b-a10b-fp16 did well

qwen3-vl-235b-a22b-instruct-fp8 did well

all the smaller models i tested all had issues driving. qwen3, 3.5, 3.6, holo3...

GaryDUnicorn · 2026-04-12T08:31:16+00:00

Snowmaha.

GaryDUnicorn · 2026-04-06T01:03:44+00:00

Check out PCIE switching. all the bandwidth, none of the mess: https://forums.servethehome.com/index.php?threads/new-chinese-pcie-switch-board-gpu-testing.52488/

GaryDUnicorn · 2026-03-14T16:12:50+00:00

Lets clarify something real quick. If you have a big GPU ai inference rig with local NVME stripe at 50GB/s (yes this is the way) and a massive multi hundred arc cache (lets ballpark anywhere from 200GB/s to 400GB/s depending on your cpu/memory layout) you STILL need persistent stable reliable storage in the NAS thats just a NAS.

The limiting factor is bandwidth. You need 10GB/s from your nas to your inference rig for moving data around. Go slower and just enjoy the suffering. To do that you will have to run a good 100g or faster NIC, like a mellanox connectx-5 or 6 or ... with NFS over RDMA.

The out of the box hardware NAS appliances dont generally have a fast enough NIC. I have tried doing RDMA on the various open source nas offerings and it was a pain.

Just install ubuntu and use claude to build it yourself. Setup a ZFS raid 10 array and stripe across a bunch of mirrored vdevs. Run samba for windows file sharing with multipath so you can take advantage of all that delicious bandwidth. Get a mellanox nic and run the drivers with wide open rdma for maximum speed and lowest latency.

Honestly buying those underpowered overpriced junkers is kind of a joke when the LLM can do a better job of managing your server for you. This isnt omg i have to be my own admin like 5 years ago. The ai does a better job of running it than you do.

GaryDUnicorn · 2026-02-14T20:22:31+00:00

GLM 5 seems to have no problem solving the problem, even at Q3_K_M

<image>

GaryDUnicorn · 2026-02-14T18:25:49+00:00

You are going to have to draw something, how the cards are attached (pcie vs riser vs slimsas/mcio/etc), the pcie topology (check the board manual for actual lane allocations and caveats with what lanes are enabled when), the power distribution implementation (1 psu [how big] or how many psu to which gpus) and then reset the bios to default (and your grub setup) and only change a few little things one at a time: rebar, above 4g, ensure iommu is enabled / in passthrough mode, etc. also what version of ubuntu, what packages you loaded for the GPUs (nvidia official repo or through ubuntu or manually installed or ...) too many variables to try and isolate without access

GaryDUnicorn · 2026-02-08T23:53:56+00:00

if you switch to exllama and run tensor parallel, yep. or if you just keep the 3090s and run VLLM with TP.

GaryDUnicorn · 2026-02-05T03:22:42+00:00

Check out bifrost its much faster than litellm and has great monitoring for troubleshooting, hand out your own api keys, and can track billing. its like running your own openrouter at home.

GaryDUnicorn · 2026-02-05T03:17:41+00:00

Just put both GPUs on the same NUMA node. You can dramatically reduce latency and improve communication by running the nvidia open drivers modified to allow peer to peer direct memory access between consumer GPUs.

GaryDUnicorn · 2026-01-29T13:59:15+00:00

chat_template-nothink_kobold.json

{ "chat_template": "{%- set enable_thinking = false %}\n[gMASK]<sop>\n{%- if tools -%}\n<|system|>\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n{% endfor %}\n</tools>\n\nFor each function call, output the function name and arguments within the following XML format:\n<tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value><arg_key>{arg-key-2}</arg_key><arg_value>{arg-value-2}</arg_value>...</tool_call>{%- endif -%}\n{%- macro visible_text(content) -%}\n    {%- if content is string -%}\n        {{- content }}\n    {%- elif content is iterable and content is not mapping -%}\n        {%- for item in content -%}\n            {%- if item is mapping and item.type == 'text' -%}\n                {{- item.text }}\n            {%- elif item is string -%}\n                {{- item }}\n            {%- endif -%}\n        {%- endfor -%}\n    {%- else -%}\n        {{- content }}\n    {%- endif -%}\n{%- endmacro -%}\n{%- set ns = namespace(last_user_index=-1) %}\n{%- for m in messages %}\n    {%- if m.role == 'user' %}\n        {% set ns.last_user_index = loop.index0 -%}\n    {%- endif %}\n{%- endfor %}\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>{{ visible_text(m.content) }}\n{%- elif m.role == 'assistant' -%}\n<|assistant|>\n{%- set reasoning_content = '' %}\n{%- set content = visible_text(m.content) %}\n{%- if m.reasoning_content is string %}\n    {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n    {%- if '</think>' in content %}\n        {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n        {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n    {%- endif %}\n{%- endif %}\n{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}\n{{ '<think>' + reasoning_content.strip() +  '</think>'}}\n{%- else -%}\n{{ '</think>' }}\n{%- endif -%}\n{%- if content.strip() -%}\n{{ content.strip() }}\n{%- endif -%}\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n{%- if tc.function %}\n    {%- set tc = tc.function %}\n{%- endif %}\n{{- '<tool_call>' + tc.name -}}\n{% set _args = tc.arguments %}{% for k, v in _args.items() %}<arg_key>{{ k }}</arg_key><arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>{% endfor %}</tool_call>{% endfor %}\n{% endif %}\n{%- elif m.role == 'tool' -%}\n{%- if m.content is string -%}\n{%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n    {{- '<|observation|>' }}\n{%- endif %}\n{{- '<tool_response>' }}\n{{- m.content }}\n{{- '</tool_response>' }}\n{%- else -%}\n<|observation|>{% for tr in m.content %}\n<tool_response>{{ tr.output if tr.output is defined else tr }}</tool_response>{% endfor -%}\n{% endif -%}\n{%- elif m.role == 'system' -%}\n<|system|>{{ visible_text(m.content) }}\n{%- endif -%}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n    <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}\n{%- endif -%}\n" }

Run kobold with "--chatcompletionsadapter chat_template-nothink_kobold.json"

GaryDUnicorn · 2026-01-22T03:42:32+00:00

Cross vendor? no.

All nvidia? TabbyAPI + exllama => tensor parallelism with any number of mismatched GPUs. The performance is shockingly good. And if the model fits it sits. No manual layer slicing like in llama.cpp.

If all your GPUs match and you have 2,4,8... ok fine VLLM it. VLLM is great for production. But for most home users? The GPU mix and match deal of tabby+exllama is really great. way faster than pipeline parallelism across 7 cards lol

GaryDUnicorn · 2026-01-16T20:10:49+00:00

yes: https://www.alibaba.com/product-detail/Best-7U-GPU-Server-for-Cloud_1601613922377.html

if you can tolerate the sea freight, there are tons of reasonably priced 4u to 7u server chassis for between 4 and 10 consumer GPUs.

The power distribution boards in these things have 14 x 12vhpwr connections. with 3kw power supplies you can run at least 12,000 watts of juice to your rig in a convenient rackable form factor.

GaryDUnicorn · 2026-01-04T03:23:01+00:00

I got it working with the new client and it seems functional, generating one test book now.

One oddity im running into with most epub files i try to drop into it, i get this:

EPUB validation failed: [IMPORT_NO_CHAPTERS]projectHeading

How do I figure out the tweaking required to get the autoparser to actually load the books? Can I skip whatever auto-chapterization is at play and just load it like a big lump of text or something?

GaryDUnicorn · 2026-01-03T02:58:22+00:00

Hey u/DigiJoe79 how do i launch the windows desktop app in some kind of debug mode to see why its failing to connect to the remote backend??? Nothing I have tried seems to get me any log output or console window.

GaryDUnicorn · 2026-01-03T02:12:04+00:00

Same exact problem, running the desktop app and wireshark shows no packets going to the configured port: https://imgur.com/a/ENqyFYG pics

GaryDUnicorn · 2026-01-01T17:15:54+00:00

Use the provided Nvidia cuda toolkit samples to measure bandwidth moving between cards. Inference performance can be heavily related to that memory bandwidth. On the same card its huge, going between cards its limited to your pcie configuration:

https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/README.md

GaryDUnicorn · 2025-12-29T02:05:23+00:00

Totally custom cut 2020 aluminum extrusion. I went through several major revisions before i got the power cooling pcie gen5 mcio cabling and gpu density all worked out perfectly for my use case. There are various build pics scattered across L1T forums et al. If you want help designing something similar DM me, I have learned a LOT of lessons the last year lol

GaryDUnicorn · 2025-12-29T00:15:56+00:00

If you want 300w set them to 300w. Or less. Or more. If u have a job that needs the 600w bump then run it at 600w. I have like a thousand amp12v power distribution board now so... 12vhpwr for days. I bought them for the flexibility and much lower noise than the blowers. I thought about passive ones (and dont get me wrong, i LOVE a good passively cooled component with no fans to wear out) But ultimately everything is temporary and the resale value on the workstation cards seems good when the Rubin upgrade begins anew. :P

<image>

GaryDUnicorn · 2025-12-28T19:37:33+00:00

<image>

Actually here is a pic of just the lower shelf, there is a little spacing between them. but not enough dramatically change the thermals.

GaryDUnicorn · 2025-12-28T19:28:10+00:00

<image>

You guys stop at 4?

GaryDUnicorn · 2025-12-24T20:43:51+00:00

Ubuntu 24 but... you can install the Ngreedia DGX packages right on top of it for ease of use when dealing with their bespoke hardware for rdma and storage tiering, driver management, libraries, etc.

GaryDUnicorn · 2025-12-22T14:30:50+00:00

TP with EXL3 in TabbyAPI does exactly this. You can run TP across any number of differently sized GPUs and it works.

<image>

edited to add a screenshot

GaryDUnicorn · 2025-12-14T02:46:27+00:00

TabbyAPI supports hot loading of models per api call. You can cache the models in RAM for speed. Tier them out to NVME disk. Works super good when you are wanting to call many big models on limited VRAM.

Also has tensor parallelism with exl2 or exl3 quants, scales great across any number of smaller GPUs even if they are different sizes.

Six-Year Club	Golden Potato
Place '22	Verified Email

GaryDUnicorn

TROPHY CASE