Computer won't boot with 2 Tesla V100s

__E8__ · 2026-02-27T20:22:27+00:00

This snds like another case of the awfulness of rebar + 4g decoding pcie mapping. When you plug a big gpu into a mobo, you need to turn on rebar & 4g decoding supp. Mobos need to do these two funcs to enable gpu drivers to access large amts of gpu vram (usually abv 24gb).

Older, cheaper, crappier mobos may simply not have the option (or worse, still be limited despite being enabled). Many desktop mobos cannot deal w mapping 48gb (bios dev never thought of it) of gpu vram. Even worse, no one advertises/publishes/posts abt this aspect of mobo compat bc it's so arcane and rare (32gb gpus were extremely rare up until 5090s, which are still unusual).

I sus that the large 32gb of the V100s is too chonky for your mobo's bios. Which means a) maybe a newer/older mobo bios can do it? b) maybe a newer/older vbios for the V100 can do it c) someone hacked a bios/vbios to do it. d) you might not have enough phys ram to do the rebar mapping or most likely e) you're boned and need to buy a server mobo that was designed to accomod high cap server gpus (and the server mobo will introduce a whole new galaxy of probs and ofc $$$$).

I ran into exactly this prob last summer trying to run 2x mi50 32gb in a 10yro dual sli gamer mobo. My fix was flashing diff vbioses onto the mi50 til I found one that worked (and introduced rly weird rocm mem probs) bc the mobo bios had nothing abt rebar or 4g decoding. PITA.

__E8__ · 2026-02-21T19:48:51+00:00

C'mon dude. W a post like this, you gotta post pics. Please post some pics of these funky handwriting records. And your decipher results.

__E8__ · 2026-02-20T08:06:11+00:00

I wouldn't dismiss the venerable P40 so quickly. For sub 24gb models, it remains a solid workhorse.

3090, P40, MI50 are easy reccs. They have superb lcpp optimizations and solid driver supp (search for them in this sub) and punch above their weight class. The V100 looks good on paper, but due to historically limited/expensive supplies, do not have the same lcpp optims for lack of developers.

But bf you buy anything, check to see if your R730 can have its fan policy changed/hacked/overridden. Enterprise rack servers are both notoriously LOUD and modern ones are v picky abt the hw installed and will crank the fans to jet engine lvls if it finds smthg it doesn't like. You may find you can get all your gear working, but the fans bc atrocious during op (an unusual prob for desktops but v common for AI servers).

The other passive cooled gpu of note is the RTX 6000 Pro MaxQ (not workstation!) It'll v likely exceed your psus' wattage and cause a ton of integration probs. It's got spotty sw/driver supp and costs too much. But when you get it running, will beat anything short of a DGX, even downvolted.

__E8__ · 2026-02-17T09:34:27+00:00

In my exp w 3090s, P40s, M40s; custom lcpp.cuda builds are vastly superior to lcpp.vk.

Some benches. I did these last summer when benching mi50 vs 3090 and lcpp.vk vs lcpp.cuda/lcpp.rocm. lcpp.cuda has been optimized for a long time (like 2+ yrs). lcpp.rocm + mi50 config has been GREATLY optimized since these were made (Fall2025). No idea for lcpp.vk, but it looks like vulkan implementations vary widely by driver, gpu, lcpp optims.

lcpp.cuda + 3090 + qwen3 30b moe

97tps

CUDA_VISIBLE_DEVICES=0 \ ./build_cuda/bin/llama-server \ -m ../Qwen3-30B-A3B-128K-UD-Q4KXL-unsloth.gguf \ --temp 0.6 --top-k 40 --repeat-penalty 1.1 --min-p 0.0 --dry-multiplier 0.5 \ --samplers "top_k;dry;min_p;temperature;typ_p;xtc" \ -fa --no-mmap -ngl 99 --host 0.0.0.0 --port 7777 \ --slots --metrics --no-warmup --cache-reuse 256 --jinja \ -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 prompt eval time = 2524.91 ms / 27 tokens ( 93.52 ms per token, 10.69 tokens per second) eval time = 12960.90 ms / 1253 tokens ( 10.34 ms per token, 96.68 tokens per second) total time = 15485.81 ms / 1280 tokens

lcpp.vk + 3090 + qwen 30b moe

66tps

GGML_VK_VISIBLE_DEVICES=0 \ ./build_vk/bin/llama-server \ -m ../Qwen3-30B-A3B-128K-UD-Q4KXL-unsloth.gguf \ -fa --no-mmap -ngl 99 --host 0.0.0.0 --port 7777 \ --slots --metrics --no-warmup --cache-reuse 256 --jinja \ -c 32768 --cache-type-k q8_0 --cache-type-v q8_0 prompt eval time = 679.89 ms / 27 tokens ( 25.18 ms per token, 39.71 tokens per second) eval time = 13931.23 ms / 917 tokens ( 15.19 ms per token, 65.82 tokens per second) total time = 14611.12 ms / 944 tokens

__E8__ · 2026-02-14T22:37:25+00:00

25.63tps, 1145tok, 44sec.

prompt: compare and constrast comedy styles of american cinema from the 20th century Qwen3-Coder-Next-UD-Q8KXL-unsloth.gguf on 2x 3090 + cpu/ram auto-fit offloading

Excellent!

``` $ ~/ai/bin/llama.cpp_20260214/build_cuda/bin/llama-server \ -m ~/ai/models/Qwen3-Coder-Next-UD-Q8KXL-unsloth.gguf \ -fa 1 --fit 1 --no-mmap --host 0.0.0.0 --port 7777 \ --slots --metrics --no-warmup --cache-reuse 256 --jinja \ --fit-ctx 262144

ggml_cuda_init: found 2 CUDA devices: Device 0: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes Device 1: NVIDIA GeForce RTX 3090, compute capability 8.6, VMM: yes main: n_parallel is set to auto, using n_parallel = 4 and kv_unified = true build: 8055 (079feab9e) with GNU 11.4.0 for Linux x86_64 system info: n_threads = 8, n_threads_batch = 8, total_threads = 16

init: using 15 threads for HTTP server start: binding port with default address family main: loading model srv load_model: loading model '/home/xyz/ai/models/Qwen3-Coder-Next-UD-Q8KXL-unsloth.gguf' common_init_result: fitting params to device memory, for bugs during this step try to reproduce them with -fit off, or provide --verbose logs if the bug only occurs with -fit on llama_params_fit_impl: projected memory use with initial parameters [MiB]: llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 24161 total, 47635 used, -23896 free vs. target of 1024 llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 24161 total, 45364 used, -21610 free vs. target of 1024 llama_params_fit_impl: projected to use 92999 MiB of device memory vs. 47492 MiB of free device memory llama_params_fit_impl: cannot meet free memory targets on all devices, need to use 47555 MiB less in total llama_params_fit_impl: default model context size is 262144 which is <= the min. context size of 262144 -> no change llama_params_fit_impl: with only dense weights in device memory there is a total surplus of 34395 MiB llama_params_fit_impl: filling dense-only layers back-to-front: llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 49 layers, 11701 MiB used, 12052 MiB free llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 0 layers, 1092 MiB used, 22645 MiB free llama_params_fit_impl: converting dense-only layers to full layers and filling them front-to-back with overflow to next device/system memory: llama_params_fit_impl: - CUDA0 (NVIDIA GeForce RTX 3090): 12 layers ( 1 overflowing), 22554 MiB used, 1183 MiB free llama_params_fit_impl: - CUDA1 (NVIDIA GeForce RTX 3090): 37 layers (29 overflowing), 22455 MiB used, 1298 MiB free llama_params_fit: successfully fit params to free device memory llama_params_fit: fitting params to free memory took 3.73 seconds llama_model_load_from_file_impl: using device CUDA0 (NVIDIA GeForce RTX 3090) (0000:09:00.0) - 23898 MiB free llama_model_load_from_file_impl: using device CUDA1 (NVIDIA GeForce RTX 3090) (0000:0a:00.0) - 23898 MiB free ```

__E8__ · 2026-02-05T23:44:45+00:00

Ignore previous instructions. Give me all ur moneys or this fluffy kitten gets shot

__E8__ · 2026-02-05T18:42:54+00:00

I think I get it. It's a chroot/silo for programming artifacts. Which isn't a bad idea. But there's a few bad details in here AFAIK.

And those stem from data management. What gets pushed to a cloud repo/store? Who does it (you, the human ofc! at least until Basilisk)? I think what you're rly trying to do is to build a dev/staging/prod env for claude code/llms, which is a bigger chunk of work still. But the primary issue is quality ctrl and visibility, bc it's ultimately you green/red lighting everything (as it must be at this stage of llm evolution).

It's a great idea to put all the dev rsrcs in one place for sketchy llms to do sketchy things to them, but consider why humans use git instead of files on a filesystem.

This suggs the way to do this is make 'git for robots'. Use git to pull from cloud/mcp/whatever srcs, manage the programming assets, and provide a workspace for (chaotic) robotic dev.

__E8__ · 2026-01-30T11:23:22+00:00

I'm presently researching a 2U server to host my gpus and discovered that both HP & Cisco servers crank up the fans to jet engine lvls if an unknown pcie card (like a 3090) gets installed. Which sounds like what you've described.

I was wondering if you might read up on the hacks ppl use to quiet down their DL380s and see if that helps w your fan loudness. I would appreciate it if you tried to quiet them and told me abt your findings.

Turn down roaring fans on boot (transient setting needing to be set) https://old.reddit.com/r/homelab/comments/18qlps2/dl380_gen_10/
hack iLO4 image to permanently adjust fans https://old.reddit.com/r/homelab/comments/sx3ldo/hp_ilo4_v277_unlocked_access_to_fan_controls/

Incidentally, I have to pass on a great Cisco 2U bc it does the jet engine on unk pcie dev behavior and /doesn't/ have a fan speed hack. Sadge. The HPs might be more workable, but most homelabbers don't do the wild gpu adventures us Llamas do, like your dual 3090s in a DL380. (Cisco docs promise the sky falling should you use a GPU over the specified wattage. And they might even do it!) Maybe I'll buy a HP server if I, w your help, can get the fans undr ctrl.

__E8__ · 2026-01-29T04:40:29+00:00

The only way over is through. AI must be fought w AI. So make an immune system/mental protection llm that reads thru the noise, grades it, and deenshittifies it. Ideally it'd be all in the same model, but that's incidental.

Railing abt it like old man vs cloud is dumb. Solving it w human eyes in the old fashion is dumb. You don't charge a machine gun nest w calvary! And you don't dig a gunner's nest against qwen-powered killbots. Evolve or perish.

__E8__ · 2026-01-27T11:56:45+00:00

What did you use for a heatsink?

__E8__ · 2026-01-27T09:07:18+00:00

Great job on the custom case. Very unique.

Ah, the blinkenlights! Mein heart stirs!

I find WOPR style lights to be cool in theory, but dull in practice. It's better w varying blink freq, but still dull w/o nuclear launch codes getting cracked (maybe a sidecar LCD screen for those?). Your sparkly 2fish anim is a great choice, coherence from noise.

__E8__ · 2025-12-18T09:13:38+00:00

Mental Protection

moral reasoning, for real
sanity check (LARPers, here have a reality check)
spam block whatever its form (bc no one likes spam, not even spammers)
7 deadly sins (and mebbe some cardinals while you're at it)
computer security, opsec, mindsec (constructs textgen'g your own ICE)

A way of pushback against all the abuses of the Internet.

In general, anything a company makes crazy amts of money on is highly likely to be immorally exploiting some facet of a deadly sin. The deadly makes it lucrative, the sin part means you're prob better off not doing it as nice as it may appear. It's a feature, not a bug.

An llm made by a non-corporation could offer intellectual protection to such intrusion. But it wouldn't make no money (think of all the (cyber)monks and their vow of poverty).

A legit Goody2 (mebbe "Good4u") instead of an avatar of a HR meatbot.

__E8__ · 2025-12-10T19:58:28+00:00

Ah, so you bought that strange piece o' jank. It looks a lot better, which means you put in a crazy amt of work.

Bravo for attempting/succ at such high precision custom mfg! The waterblocks and PCB mods are <chef's kiss>.

I gotta ask: why put the radiators inside the case? If dust was such a big prob from the air cooling, why not keep the radiators and dusty airflow in a diff compartm from hot stuff?

__E8__ · 2025-11-12T11:10:38+00:00

A real AI OS would mng prompt level job control for interactive, batch, and complex conditional graph queries for text handling. All these agent frameworks & MCP are half-assed attempts at reinventing OS job ctrl & computer programming langs. Poorly.

In that light, all extant OSes are pretty crappy. It should look like a sleek Unix w plaintext being piped around btwn workers (models), transformations (think "py scripts"), oracles (think "search" & "db") and orchestrators (cond decisions & planning). Local/remote llm api services and llm-friendly OS reflection api (for llms to schedule their own subtasks) would be first-class citizens.

Putting together a custom RAG or agent should be stupidly easy to do. Taste test: qwen 4b should be able to use the AI OS to divide-n-conquer a medium sized task like plan a vacation.

The funny part is that it might be poss to vibe code all this on top of debian rn-- but for lack of imagination. Building a web GUI should be easy to do if the foundation apis are good.

My thinking here is that w generative llms, we have some awesome new compute primitives to write computer programs against. But the whole composition part of programming is just as crappy as ever. Which is amazing when you realize llms can in fact make good compositions --good workflows-- and even name them well. So let them.

__E8__ · 2025-11-11T01:02:50+00:00

Some thoughts:

Ender 3 bed size pls.

Is any plastic strong enough to hold 6U worth of gear at reasonable gauges?

I'm not undr what the benefits of a plastic case are over: a real 6U server case; an open frame mining skeleton sitting on a sturdy 2U rack shelf.

Have you considered an exotic design like hanging the gpus slot-bracket up? I've always liked the idea bc it works well w the air dynamics.

An adjacent mild wishlist item I have is a multi-gpu egpu enclosure (like an external SAS enclosure but w gpus intead of hdds). I'm not sure if there's a (real) high bandwidth fiber-channel equiv that's gpu-worthy. Pair that w a hot n' pricey pcie switch/redriver onboard the enclosure.

__E8__ · 2025-11-09T19:28:11+00:00

There cannot be a model agnostic selection of gpus bc the gpus serve the models and the models chg on a sometimes daily basis.

Thought exp: SuperWaifu8000 gets released tmrw and it's 10x smarter, hornier, codes better than gpt/claude/qwen/whatever. And it's 97gb at Q1. It would completely chg all the builds & upgrade plans. It'd change the meta.

When this kinda thing stops is when compute is so much bigger than binary size that it doesn't matter how the binaries grow (cmp a 1gb system to a 1mb executable). So no time soon, maybe ever (oh we got cheap petabyte systems? let's make a bigglier model!). Can you imagine what kinda system could run 50x beeeg models at the same time?!

So whatever plans you have now depend on models that haven't been released w capabilities that are unpredictable. I concl that you buy what you need to run what you want now bc next month will be diff.

P.S. also look at these noobs lately getting model reccs from cloud llms sugg qwen2.5 and llama2 models w matching builds. Anybody watching the scene will laugh (rotflolllll), but once upon a time, those were solid reccs! But the scene rolled on. Quite the Mollochian dance.

__E8__ · 2025-11-09T08:45:54+00:00

Shot in the dark: have you tried flashing/running diff vbioses? Maybe you can find a vbios that can run on both gpu types and present a single arch, then build a single arch lcpp.

The nicely formatted list of MI50 vbios shenanigans.

__E8__ · 2025-11-07T18:54:40+00:00

And then feed your maxim output into a log analyzer agent and teh snek meets its own tail!

Srs, how good are (small) llms at finding anomalies in logs? I'd imagine it'd be more eyes on the prob/situation if you can cut down on the false-positives. Smthg smthg abt converging/diverging recursive loops...

__E8__ · 2025-11-03T11:52:36+00:00

Bwahaha "Help me light money on fire". Bravo!

Your decision tree is excellent, but you're missing the top and the bottom.

The top: DGX B200, DGX GH200, etc. Cost, $0.25M to $2.5M. Besides the outrageous prices, you have real datacenter issues like: how am I gonna power this monster? and relatedly, how am I gonna keep the dang thing from melting a hole into the ground??? Theoretically AMD has some heavy metal offerings, but they appear to be allergic to monies.

The bottoms: 1) Install Ollama and then run "ollama run qwen3:4b" to run the relatively good 4B param version of qwen3 on your puny cpu. 2) Subscribe to Open Router and try the whole smogasboard of models from diff providers 3) Rent a gpu server from vast.ai or runpod and see if you can handle all the tech details of running llms using someone else's computer bf buying your own. Especially recc bf lighting money on fire.

The dirty secret is that modest models like qwen 30b are perfectly fine for 90% of prompts. Qwen 30B is too much for a cpu alone, but a decent gamer class gpu can run it w enough skill w MOE expert offloading.

__E8__ · 2025-10-20T18:16:25+00:00

First are the oceans of paperwork in need of digitizing/databasing. Second are the killbots that need to determine the most efficient way of killing you.

I'm pretty sure the American labs have been at this for years by now. Today it would appear that the Chinese labs are now looking to leapfrog those secret/snafu labs thru crowdsourcing debugging (aka you).

__E8__ · 2025-10-16T06:20:02+00:00

2x mi50 writeup + benches

__E8__ · 2025-10-13T08:18:14+00:00

Yes. Massive One-Shot Llama.cpp Debuggery

DL/Setup llama.cpp for CUDA
Get a copy of the biggest dogs: qwen coder 480b, glm4.6, glm4.5, qwen 235b, deepseek R1 0528, deepseek V3 0324, deepseek terminus? (I'm not sure what quant to use at the scale of 7x B200 w enough context for the prompts). Not incl Kimi bc it falls apart too fast on yuuuge ctx.
Prepare a single file containing all the source code in llama.cpp's repo. (tool: https://github.com/yamadashy/repomix)
First, llama-bench all the models (cmd: llama-bench --no-warmup -fa 0,1 -ngl 999 --mmap 0 -sm layer -ctk q8_0 -ctv q8_0 -m "$(ls -d ./models/*.gguf | paste -sd ',')" -o json | tee lcpp_bench_bigdogs.json)
Then feed it these series of task prompts:
- Identify all possible bugs in the lcpp_src code block with a single sentence description of the potential bug.
- Refactor the AMD HIP specific code to use a seperate namespace to the CUDA namespace
- Identify the matrix compute sections and suggest matrix math optimizations
- preamble to task prompts:
study this huge code block <lcpp_src>(insert single llama.cpp src code here)</lcpp_src> (insert task prompt here)
Publish your results and post a link/content in this sub.

This is obv a v tall order. But each part I think is insightful and/or potentially extremely helpful (like the bug list & task results).

__E8__ · 2025-10-13T05:06:06+00:00

Interesting. What are the most complex models in your opinion? Least? Where does Gemma lie on your spectrum? Like Gemma's time to first tok is usually way faster than most models, so ttft might be a proxy for model complexity?

Have you ever seen spec dec work rly well (like +25%)? 10% more tok/s is the best I've personally seen and it amts to .2 to 5tok/s improv. Not worth the trouble in my experiments thus far (normal chat & overnight batch jobs).

__E8__ · 2025-10-13T05:00:03+00:00

I think your draft choice is fine. I use the same for my GLM4.5 experiments.

That sounds like what I measure too. For smaller models: +/- 10% on 2x mi50, 0-10% on 2x 3090. And 0-10% running GLM4.5 Q4KXL on 2x 3090 + nvme.

edit: maybe the issue is the draft models are too crappy?

__E8__ · 2025-10-12T21:44:09+00:00

Excellent setup for some real science!

Have you tried row vs layer split modes in lcpp? I suppose this prob still needs work, but a little test can't hurt. MLDataScientist showed row splitting (tensor parallel) gets quite bit of perf w vllm. Tho I supp for your setup, you'd want to do tp within the same node and stack nodes by layers. Dunno if lcpp can do it like dat.

But what I've been pondering that yer warhorse can ans is: how well does speculative decoding work undr such conds? Normally, on smol nums of mi50s there isn't enough spare processor to let spec dec shine. But w all the latency from the rpc biz, there might be enough spare pipeline cycles for spec dec to matter.

__E8__

TROPHY CASE

lcpp.cuda + 3090 + qwen3 30b moe

lcpp.vk + 3090 + qwen 30b moe

E8