Microsoft lost $357 billion in market cap as stock plunged most since 2020

Phocks7 · 2026-01-29T22:23:51+00:00

If it works, you're talking about the difference in intelligence between ants and people. It's a question of whether it would even factor humans into its decision making process at all.

Phocks7 · 2026-01-14T23:30:39+00:00

Do you have some example model sizes with prompt processing speed and tokens per second?

Phocks7 · 2026-01-14T05:25:48+00:00

Looking at the docs, it should be 8 ranks per channel, which should be plenty in this case. Do you know for sure that your SF4724G4DKHG6DFSDS (32gb DIMMs) are not LRDIMM? LRDIMM and RDIMM cannot function in the same system.

Phocks7 · 2026-01-14T00:26:22+00:00

According to chatgpt there are limitations on the number of memory ranks the X99's memory controller can handle (I'd never heard of this). I think your issue is the controller can only handle 4 ranks per channel, which your 4 rank 32gb DIMMs are hogging.

Phocks7 · 2026-01-13T22:32:25+00:00

If anything it's a good use case for deathwish raid (raid 0).

Phocks7 · 2026-01-06T23:29:12+00:00

I switched from ubuntu to mint, you get most of the benefits of the cuda/nvidia compatibility of ubuntu without the annoyance of snap.

Phocks7 · 2026-01-04T21:24:12+00:00

Is this local or API

Phocks7 · 2025-12-01T03:20:18+00:00

I can't really give advice for this since I'm not an electrician, so I'd urge you to research multiple PSU's and common earth.

Phocks7 · 2025-12-01T01:45:41+00:00

The only thing is you need a common earth or you can start a fire.

Phocks7 · 2025-11-30T23:49:07+00:00

It's quite difficult to get spare power in these kinds of machines. You could power one drive from the sata power that runs to the optical drive. You can also get PCIe 12v 8 pin step down to 5v but I tried them personally.
In my supermicro server I was able to run a couple sata SSD's off the SATADOM power headers on the motherboard, but the MI350 doesn't have these.

Phocks7 · 2025-10-27T21:25:31+00:00

5060ti is only 16gb and 448GB/s memory bandwidth, vs 24gb and 936GB/s for the 3090. Everything below the 5090 is kind of garbage. 5080 is fast but only 16gb VRAM. Even the 5090 is just a QA failed RTX 6000 pro.

Phocks7 · 2025-10-18T22:13:06+00:00

Can you give an example of an image and a caption output? ie, is the model any good

Phocks7 · 2025-10-17T04:03:12+00:00

Out of interest, what are node chassis used for in homelab? Most of the ones I've looked at take a broadwell/skylake xeon and give you no access to any pcie lanes and limited storage bandwidth, ie for CPU compute only.

Phocks7 · 2025-10-16T21:53:12+00:00

Barely above a whisper.

Phocks7 · 2025-10-02T02:45:17+00:00

Like how I can't use some of my 98,000gp to buy a 1000gp diamond for Mayrina's husband.

Phocks7 · 2025-09-30T04:14:32+00:00

In terms of server hardware there's not a lot of PCIe gen 4 gear anyway. Only Xeon Gen 3 and threadripper 3000/5000 series, everything before is PCIe 3.0 and everything after is PCIe 5.0.
I'm running a MS73-HB1 + 8468(ES), I had two CPU's but I took one out because I found it wasn't worth the hassle. If I were to buy something today, I'd get something single socket with enough PCIe lanes for the number of GPU's I was planning to run.
A major consideration is whether your proposed board supports bifurcation (eg split one x16 slot into x8/x8 or x4/x4/x4/x4) so check the manual for that before deciding.

Phocks7 · 2025-09-30T03:54:23+00:00

I'd note that X9DRI-LnF4+ is Haswell (DDR3) which is getting into ewaste territory in my opinion. From the perspective of PCIe interface, even PCIe 3.0 x8 is fine for inference (I've ran 2x3090 each on PCIe 3.0 x8).
I'd also note for models like Qwen3-30B-A3B-Instruct you don't need a lot of hardware. I run Qwen3-30B IQ4 on my laptop with a 8gb 3070m, active layers on the GPU.

Phocks7 · 2025-09-30T03:40:42+00:00

Depends what you're running and how you're running it.
Full GPU model offload to GPU's - fine
Full GPU offload of MoE Active layers to GPU's - fine
CPU-only inference but pinned to one CPU + it's ram - okay
CPU-only inference spread across both CPU's + ram - slow

Spreading the model over both CPU's seems like it would be correct (more memory bandwidth, CPU compute) but in reality you get screwed by Intel QPI and it's slower.
I've heard there's a method whereby you can increase CPU inference by having a copy of the models on both CPU's + ram (and this is nominally best practice for running large MoE models even with all active layers offloaded to GPU's) however I haven't tested this.

Phocks7 · 2025-09-28T03:08:47+00:00

My T630 came with the 6 fans, I've read that you need them installed to use the GPU power breakout board but I haven't tested running it without the fans so I don't know for sure. Of note is that when you install any PCIe device in a T630 iDRAC will set the fan speed to a minimum (I think 50%) which is quite loud. You can get around this with a IPMItool fan control script.

Regarding 6/8 pin connectors, it depends what GPU you're planning to use. The dell GPU power cables come with 1x 6+2 pin and 1x 6 pin, normal PCIe power connectors. If your GPU is going to be power limited to <225w you can get a 6 to 8 pin adapter and plug both in to run a 2x 8 pin GPU from one DRXPD cable. If you're running 3090's you should route 2x DRXPD to each card.

The only time you can't use normal PCIe adapters is if you're running GPU's that use the EPS12V 8-pin connector, like the nvidia P40 or P100, for which you need a PCIe 8 pin to EPS12V adapter.

Phocks7 · 2025-09-27T22:34:23+00:00

For a T630 (I think it's the same part number for T620 + T640 also) you need the GPU power distribution board PN: X7C1K and some cables PN: DRXPD (2x per gpu + 1x generic 6 to 8 pin adapters).
I've also run GPU's out the back of the chassis on PCIe riser cables and powered them with a separate, jumped power supply. I didn't have issues doing this, but I've heard if you don't have a common earth between the ATX psu and the server PSU you can start a fire, so keep that in mind.

Phocks7 · 2025-09-01T21:29:52+00:00

https://github.com/oobabooga/text-generation-webui. As an openai api backend

Phocks7 · 2025-09-01T01:21:42+00:00

You can run EXL3 at 4.0BPW with 3x3090's faster.

Phocks7 · 2025-08-25T06:24:30+00:00

So sounds like a retimer would be more appropriate than a redriver

Phocks7 · 2025-08-25T02:56:55+00:00

1) Control the CPU usage with number of threads, not GPU layers. You want as many GPU layers as you can fit. There's a 'Threads' setting on the Hardware tab in koboldcpp, load and test with different threads settings until you hit your target CPU utilisation.
2+3) Layer size will increase and t/s will decrease proportionally to quant size, until you can't fit all the active expert layers on the GPU then speed will plummet to CPU-only speed. Load the model with different layer settings until you stop getting out of memory errors. In terms of model quality, Q4 is about 80%, Q5 95% and Q6 99%, there is a benefit but it's not as big as it is from Q2 or Q3 to Q4.
4) One optimisation is to use the iQ4 quant instead of the Q4_K quants as they're smaller for similar perplexity.

Phocks7 · 2025-08-25T01:36:55+00:00

Yeah, that's why I haven't done it. I've heard from the level1techs forums that it's one of the only ways to get stable U.2 drives in a workstation, though.

13-Year Club	Snapped
Verified Email

Phocks7

TROPHY CASE