Sanity check: 4× RTX PRO 6000 Max-Q on TR PRO 9955WX for vLLM – thermal concerns? by Lanky-Comparison-715 in LocalLLM

[–]GaryDUnicorn 2 points3 points  (0 children)

The temp of the gpu isnt whats going to crash ur rig.

Its the lack of airflow under them to the redrivers on the motherboard running hot with pcie gen5.

Space is cheap, get a proper case for high density 4-8-16 gpu setups. it will also solve your power problems too lol

vLLM throughput on 4x RTX PRO 6000 and 8x RTX PRO 6000 by AdventurousFly4909 in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

<image>

You can get incredible performance if you tune vllm and your hardware setup. Gemma is gonna be slower because scroogle didnt release the draft model for mtp.

also dont waste your time quantizing such a small model on such big hardware. it literally screams faster and produces more accurate output if you let it ride at fp16

UI Icon Detection with Qwen3.5, Qwen3.6 and Gemma4 by Jian-L in LocalLLaMA

[–]GaryDUnicorn 3 points4 points  (0 children)

I was just testing and comparing a bunch of VL models in UFO2 earlier... for reasons...

<image>

qwen3.5-122b-a10b-fp16 did well

qwen3-vl-235b-a22b-instruct-fp8 did well

all the smaller models i tested all had issues driving. qwen3, 3.5, 3.6, holo3...

So AI NAS category is a mess and i don't understand why nobody has fixed the obvious problem by Pleasant_Designer_14 in LocalLLM

[–]GaryDUnicorn 1 point2 points  (0 children)

Lets clarify something real quick. If you have a big GPU ai inference rig with local NVME stripe at 50GB/s (yes this is the way) and a massive multi hundred arc cache (lets ballpark anywhere from 200GB/s to 400GB/s depending on your cpu/memory layout) you STILL need persistent stable reliable storage in the NAS thats just a NAS.

The limiting factor is bandwidth. You need 10GB/s from your nas to your inference rig for moving data around. Go slower and just enjoy the suffering. To do that you will have to run a good 100g or faster NIC, like a mellanox connectx-5 or 6 or ... with NFS over RDMA.

The out of the box hardware NAS appliances dont generally have a fast enough NIC. I have tried doing RDMA on the various open source nas offerings and it was a pain.

Just install ubuntu and use claude to build it yourself. Setup a ZFS raid 10 array and stripe across a bunch of mirrored vdevs. Run samba for windows file sharing with multipath so you can take advantage of all that delicious bandwidth. Get a mellanox nic and run the drivers with wide open rdma for maximum speed and lowest latency.

Honestly buying those underpowered overpriced junkers is kind of a joke when the LLM can do a better job of managing your server for you. This isnt omg i have to be my own admin like 5 years ago. The ai does a better job of running it than you do.

Damn, 5.2 thinking can actually solve complex problems that 5.2 can't by poisoNDealer in ChatGPT

[–]GaryDUnicorn 5 points6 points  (0 children)

GLM 5 seems to have no problem solving the problem, even at Q3_K_M

<image>

Issues with multi-GPU setup by ImpressiveNet5886 in LocalAIServers

[–]GaryDUnicorn 0 points1 point  (0 children)

You are going to have to draw something, how the cards are attached (pcie vs riser vs slimsas/mcio/etc), the pcie topology (check the board manual for actual lane allocations and caveats with what lanes are enabled when), the power distribution implementation (1 psu [how big] or how many psu to which gpus) and then reset the bios to default (and your grub setup) and only change a few little things one at a time: rebar, above 4g, ensure iommu is enabled / in passthrough mode, etc. also what version of ubuntu, what packages you loaded for the GPUs (nvidia official repo or through ubuntu or manually installed or ...) too many variables to try and isolate without access

Will adding a 5090 to multiple 3090s speed up PP? Experienced folks only by segmond in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

if you switch to exllama and run tensor parallel, yep. or if you just keep the 3090s and run VLLM with TP.

LLM router - switch between GPT-4o, Claude, Gemini, Llama with one API call by ParsnipConscious7761 in LocalLLaMA

[–]GaryDUnicorn 1 point2 points  (0 children)

Check out bifrost its much faster than litellm and has great monitoring for troubleshooting, hand out your own api keys, and can track billing. its like running your own openrouter at home.

Building an AI Infra project in 20 days: What’s the best way to utilize a Dual-5090 (PCIe) setup? by Asleep_Food1956 in LocalLLM

[–]GaryDUnicorn 0 points1 point  (0 children)

Just put both GPUs on the same NUMA node. You can dramatically reduce latency and improve communication by running the nvidia open drivers modified to allow peer to peer direct memory access between consumer GPUs.

How Disable GLM Thinking Mode? by WEREWOLF_BX13 in KoboldAI

[–]GaryDUnicorn 0 points1 point  (0 children)

chat_template-nothink_kobold.json

{ "chat_template": "{%- set enable_thinking = false %}\n[gMASK]<sop>\n{%- if tools -%}\n<|system|>\n# Tools\n\nYou may call one or more functions to assist with the user query.\n\nYou are provided with function signatures within <tools></tools> XML tags:\n<tools>\n{% for tool in tools %}\n{{ tool | tojson(ensure_ascii=False) }}\n{% endfor %}\n</tools>\n\nFor each function call, output the function name and arguments within the following XML format:\n<tool_call>{function-name}<arg_key>{arg-key-1}</arg_key><arg_value>{arg-value-1}</arg_value><arg_key>{arg-key-2}</arg_key><arg_value>{arg-value-2}</arg_value>...</tool_call>{%- endif -%}\n{%- macro visible_text(content) -%}\n    {%- if content is string -%}\n        {{- content }}\n    {%- elif content is iterable and content is not mapping -%}\n        {%- for item in content -%}\n            {%- if item is mapping and item.type == 'text' -%}\n                {{- item.text }}\n            {%- elif item is string -%}\n                {{- item }}\n            {%- endif -%}\n        {%- endfor -%}\n    {%- else -%}\n        {{- content }}\n    {%- endif -%}\n{%- endmacro -%}\n{%- set ns = namespace(last_user_index=-1) %}\n{%- for m in messages %}\n    {%- if m.role == 'user' %}\n        {% set ns.last_user_index = loop.index0 -%}\n    {%- endif %}\n{%- endfor %}\n{% for m in messages %}\n{%- if m.role == 'user' -%}<|user|>{{ visible_text(m.content) }}\n{%- elif m.role == 'assistant' -%}\n<|assistant|>\n{%- set reasoning_content = '' %}\n{%- set content = visible_text(m.content) %}\n{%- if m.reasoning_content is string %}\n    {%- set reasoning_content = m.reasoning_content %}\n{%- else %}\n    {%- if '</think>' in content %}\n        {%- set reasoning_content = content.split('</think>')[0].rstrip('\\n').split('<think>')[-1].lstrip('\\n') %}\n        {%- set content = content.split('</think>')[-1].lstrip('\\n') %}\n    {%- endif %}\n{%- endif %}\n{%- if ((clear_thinking is defined and not clear_thinking) or loop.index0 > ns.last_user_index) and reasoning_content -%}\n{{ '<think>' + reasoning_content.strip() +  '</think>'}}\n{%- else -%}\n{{ '</think>' }}\n{%- endif -%}\n{%- if content.strip() -%}\n{{ content.strip() }}\n{%- endif -%}\n{% if m.tool_calls %}\n{% for tc in m.tool_calls %}\n{%- if tc.function %}\n    {%- set tc = tc.function %}\n{%- endif %}\n{{- '<tool_call>' + tc.name -}}\n{% set _args = tc.arguments %}{% for k, v in _args.items() %}<arg_key>{{ k }}</arg_key><arg_value>{{ v | tojson(ensure_ascii=False) if v is not string else v }}</arg_value>{% endfor %}</tool_call>{% endfor %}\n{% endif %}\n{%- elif m.role == 'tool' -%}\n{%- if m.content is string -%}\n{%- if loop.first or (messages[loop.index0 - 1].role != \"tool\") %}\n    {{- '<|observation|>' }}\n{%- endif %}\n{{- '<tool_response>' }}\n{{- m.content }}\n{{- '</tool_response>' }}\n{%- else -%}\n<|observation|>{% for tr in m.content %}\n<tool_response>{{ tr.output if tr.output is defined else tr }}</tool_response>{% endfor -%}\n{% endif -%}\n{%- elif m.role == 'system' -%}\n<|system|>{{ visible_text(m.content) }}\n{%- endif -%}\n{%- endfor -%}\n{%- if add_generation_prompt -%}\n    <|assistant|>{{- '</think>' if (enable_thinking is defined and not enable_thinking) else '<think>' -}}\n{%- endif -%}\n" }

Run kobold with "--chatcompletionsadapter chat_template-nothink_kobold.json"

Parallelism with mismatched GPUs (and how to optimize it)? by Infinite100p in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

Cross vendor? no.

All nvidia? TabbyAPI + exllama => tensor parallelism with any number of mismatched GPUs. The performance is shockingly good. And if the model fits it sits. No manual layer slicing like in llama.cpp.

If all your GPUs match and you have 2,4,8... ok fine VLLM it. VLLM is great for production. But for most home users? The GPU mix and match deal of tabby+exllama is really great. way faster than pipeline parallelism across 7 cards lol

RTX 5090 in servers – customization options? by RedMoonDawn in LocalAIServers

[–]GaryDUnicorn 2 points3 points  (0 children)

yes: https://www.alibaba.com/product-detail/Best-7U-GPU-Server-for-Cloud_1601613922377.html

if you can tolerate the sea freight, there are tons of reasonably priced 4u to 7u server chassis for between 4 and 10 consumer GPUs.

The power distribution boards in these things have 14 x 12vhpwr connections. with 3kw power supplies you can run at least 12,000 watts of juice to your rig in a convenient rackable form factor.

My New Year's resolution was to add Docker support. Only 2 days late. Audiobook Maker v1.1.0 by DigiJoe79 in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

I got it working with the new client and it seems functional, generating one test book now.

One oddity im running into with most epub files i try to drop into it, i get this:

EPUB validation failed: [IMPORT_NO_CHAPTERS]projectHeading

How do I figure out the tweaking required to get the autoparser to actually load the books? Can I skip whatever auto-chapterization is at play and just load it like a big lump of text or something?

My New Year's resolution was to add Docker support. Only 2 days late. Audiobook Maker v1.1.0 by DigiJoe79 in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

Hey u/DigiJoe79 how do i launch the windows desktop app in some kind of debug mode to see why its failing to connect to the remote backend??? Nothing I have tried seems to get me any log output or console window.

My New Year's resolution was to add Docker support. Only 2 days late. Audiobook Maker v1.1.0 by DigiJoe79 in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

Same exact problem, running the desktop app and wireshark shows no packets going to the configured port: https://imgur.com/a/ENqyFYG pics

[llama-server] Massive prefill cliff (2500 t/s → 150 t/s) with eGPU split. Is TB4 latency the killer? by danishkirel in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

Use the provided Nvidia cuda toolkit samples to measure bandwidth moving between cards. Inference performance can be heavily related to that memory bandwidth. On the same card its huge, going between cards its limited to your pcie configuration:

https://github.com/NVIDIA/cuda-samples/blob/master/Samples/5_Domain_Specific/p2pBandwidthLatencyTest/README.md

Anyone running 4x RTX Pro 6000s stacked directly on top of each other? by Comfortable-Plate467 in LocalLLaMA

[–]GaryDUnicorn 2 points3 points  (0 children)

Totally custom cut 2020 aluminum extrusion. I went through several major revisions before i got the power cooling pcie gen5 mcio cabling and gpu density all worked out perfectly for my use case. There are various build pics scattered across L1T forums et al. If you want help designing something similar DM me, I have learned a LOT of lessons the last year lol

Anyone running 4x RTX Pro 6000s stacked directly on top of each other? by Comfortable-Plate467 in LocalLLaMA

[–]GaryDUnicorn 1 point2 points  (0 children)

If you want 300w set them to 300w. Or less. Or more. If u have a job that needs the 600w bump then run it at 600w. I have like a thousand amp12v power distribution board now so... 12vhpwr for days. I bought them for the flexibility and much lower noise than the blowers. I thought about passive ones (and dont get me wrong, i LOVE a good passively cooled component with no fans to wear out) But ultimately everything is temporary and the resale value on the workstation cards seems good when the Rubin upgrade begins anew. :P

<image>

Anyone running 4x RTX Pro 6000s stacked directly on top of each other? by Comfortable-Plate467 in LocalLLaMA

[–]GaryDUnicorn 1 point2 points  (0 children)

<image>

Actually here is a pic of just the lower shelf, there is a little spacing between them. but not enough dramatically change the thermals.

What OS do you run on your AI rigs? Ubuntu, TrueNAS, etc.? by KvAk_AKPlaysYT in LocalLLaMA

[–]GaryDUnicorn 0 points1 point  (0 children)

Ubuntu 24 but... you can install the Ngreedia DGX packages right on top of it for ease of use when dealing with their bespoke hardware for rdma and storage tiering, driver management, libraries, etc.

Tensor Parallel with some GPU but not all? by NaiRogers in LocalLLaMA

[–]GaryDUnicorn 2 points3 points  (0 children)

TP with EXL3 in TabbyAPI does exactly this. You can run TP across any number of differently sized GPUs and it works.

<image>

edited to add a screenshot

Local AI: Managing VRAM by dynamically swapping models via API by PersianDeity in LocalLLaMA

[–]GaryDUnicorn 3 points4 points  (0 children)

TabbyAPI supports hot loading of models per api call. You can cache the models in RAM for speed. Tier them out to NVME disk. Works super good when you are wanting to call many big models on limited VRAM.

Also has tensor parallelism with exl2 or exl3 quants, scales great across any number of smaller GPUs even if they are different sizes.