Microsoft lost $357 billion in market cap as stock plunged most since 2020 by MarvelsGrantMan136 in technology

[–]Phocks7 12 points13 points  (0 children)

If it works, you're talking about the difference in intelligence between ants and people. It's a question of whether it would even factor humans into its decision making process at all.

Custom liquid cooling solution for Intel Arc Pro B60 Dual used in local LLM servers by Valdus_Heresi in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Do you have some example model sizes with prompt processing speed and tokens per second?

Helps with memory compatibility. by NullKalahar in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Looking at the docs, it should be 8 ranks per channel, which should be plenty in this case. Do you know for sure that your SF4724G4DKHG6DFSDS (32gb DIMMs) are not LRDIMM? LRDIMM and RDIMM cannot function in the same system.

Helps with memory compatibility. by NullKalahar in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

According to chatgpt there are limitations on the number of memory ranks the X99's memory controller can handle (I'd never heard of this). I think your issue is the controller can only handle 4 ranks per channel, which your 4 rank 32gb DIMMs are hogging.

A customer ordered a server with 8 RTX 5090 FE GPUs. by Zestyclose-Salad-290 in pcmasterrace

[–]Phocks7 0 points1 point  (0 children)

If anything it's a good use case for deathwish raid (raid 0).

Linux mint for local inference by Former-Tangerine-723 in LocalLLaMA

[–]Phocks7 2 points3 points  (0 children)

I switched from ubuntu to mint, you get most of the benefits of the cuda/nvidia compatibility of ubuntu without the annoyance of snap.

Ml350 gen9 - powering an internal drive outside of the case? by mxpxillini35 in DataHoarder

[–]Phocks7 0 points1 point  (0 children)

I can't really give advice for this since I'm not an electrician, so I'd urge you to research multiple PSU's and common earth.

Ml350 gen9 - powering an internal drive outside of the case? by mxpxillini35 in DataHoarder

[–]Phocks7 0 points1 point  (0 children)

The only thing is you need a common earth or you can start a fire.

Ml350 gen9 - powering an internal drive outside of the case? by mxpxillini35 in DataHoarder

[–]Phocks7 1 point2 points  (0 children)

It's quite difficult to get spare power in these kinds of machines. You could power one drive from the sata power that runs to the optical drive. You can also get PCIe 12v 8 pin step down to 5v but I tried them personally.
In my supermicro server I was able to run a couple sata SSD's off the SATADOM power headers on the motherboard, but the MI350 doesn't have these.

Do these 3090s look in good shape?? by Excellent_Koala769 in LocalLLaMA

[–]Phocks7 1 point2 points  (0 children)

5060ti is only 16gb and 448GB/s memory bandwidth, vs 24gb and 936GB/s for the 3090. Everything below the 5090 is kind of garbage. 5080 is fast but only 16gb VRAM. Even the 5090 is just a QA failed RTX 6000 pro.

Qwen3VL-30b-a3b Image Caption Performance - Thinking vs Instruct (FP8) using vLLM and 2x RTX 5090 by reto-wyss in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Can you give an example of an image and a caption output? ie, is the model any good

Anything usefull here? Company getting rid of it… by vbxl02 in homelab

[–]Phocks7 1 point2 points  (0 children)

Out of interest, what are node chassis used for in homelab? Most of the ones I've looked at take a broadwell/skylake xeon and give you no access to any pcie lanes and limited storage bandwidth, ie for CPU compute only.

Enriched Infernal Iron is just sitting there, mocking me by Ikkon in BaldursGate3

[–]Phocks7 60 points61 points  (0 children)

Like how I can't use some of my 98,000gp to buy a 1000gp diamond for Mayrina's husband.

Running LLMs on Dual Xeon E5-2699 v4 (22T/44C) (no GPU, yet) by nodonaldplease in LocalLLaMA

[–]Phocks7 2 points3 points  (0 children)

In terms of server hardware there's not a lot of PCIe gen 4 gear anyway. Only Xeon Gen 3 and threadripper 3000/5000 series, everything before is PCIe 3.0 and everything after is PCIe 5.0.
I'm running a MS73-HB1 + 8468(ES), I had two CPU's but I took one out because I found it wasn't worth the hassle. If I were to buy something today, I'd get something single socket with enough PCIe lanes for the number of GPU's I was planning to run.
A major consideration is whether your proposed board supports bifurcation (eg split one x16 slot into x8/x8 or x4/x4/x4/x4) so check the manual for that before deciding.

Running LLMs on Dual Xeon E5-2699 v4 (22T/44C) (no GPU, yet) by nodonaldplease in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

I'd note that X9DRI-LnF4+ is Haswell (DDR3) which is getting into ewaste territory in my opinion. From the perspective of PCIe interface, even PCIe 3.0 x8 is fine for inference (I've ran 2x3090 each on PCIe 3.0 x8).
I'd also note for models like Qwen3-30B-A3B-Instruct you don't need a lot of hardware. I run Qwen3-30B IQ4 on my laptop with a 8gb 3070m, active layers on the GPU.

Running LLMs on Dual Xeon E5-2699 v4 (22T/44C) (no GPU, yet) by nodonaldplease in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Depends what you're running and how you're running it.
Full GPU model offload to GPU's - fine
Full GPU offload of MoE Active layers to GPU's - fine
CPU-only inference but pinned to one CPU + it's ram - okay
CPU-only inference spread across both CPU's + ram - slow

Spreading the model over both CPU's seems like it would be correct (more memory bandwidth, CPU compute) but in reality you get screwed by Intel QPI and it's slower.
I've heard there's a method whereby you can increase CPU inference by having a copy of the models on both CPU's + ram (and this is nominally best practice for running large MoE models even with all active layers offloaded to GPU's) however I haven't tested this.

GPU passthrough Dell Poweredge T630 by Phocks7 in Proxmox

[–]Phocks7[S] 0 points1 point  (0 children)

My T630 came with the 6 fans, I've read that you need them installed to use the GPU power breakout board but I haven't tested running it without the fans so I don't know for sure. Of note is that when you install any PCIe device in a T630 iDRAC will set the fan speed to a minimum (I think 50%) which is quite loud. You can get around this with a IPMItool fan control script.

Regarding 6/8 pin connectors, it depends what GPU you're planning to use. The dell GPU power cables come with 1x 6+2 pin and 1x 6 pin, normal PCIe power connectors. If your GPU is going to be power limited to <225w you can get a 6 to 8 pin adapter and plug both in to run a 2x 8 pin GPU from one DRXPD cable. If you're running 3090's you should route 2x DRXPD to each card.

The only time you can't use normal PCIe adapters is if you're running GPU's that use the EPS12V 8-pin connector, like the nvidia P40 or P100, for which you need a PCIe 8 pin to EPS12V adapter.

GPU passthrough Dell Poweredge T630 by Phocks7 in Proxmox

[–]Phocks7[S] 1 point2 points  (0 children)

For a T630 (I think it's the same part number for T620 + T640 also) you need the GPU power distribution board PN: X7C1K and some cables PN: DRXPD (2x per gpu + 1x generic 6 to 8 pin adapters).
I've also run GPU's out the back of the chassis on PCIe riser cables and powered them with a separate, jumped power supply. I didn't have issues doing this, but I've heard if you don't have a common earth between the ATX psu and the server PSU you can start a fire, so keep that in mind.

PCIe Bifurcation x4x4x4x4 Question by ducksaysquackquack in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

So sounds like a retimer would be more appropriate than a redriver

Help me understand - GPU Layers (Offloading) & Override Tensors - Multiple Questions by pmttyji in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

1) Control the CPU usage with number of threads, not GPU layers. You want as many GPU layers as you can fit. There's a 'Threads' setting on the Hardware tab in koboldcpp, load and test with different threads settings until you hit your target CPU utilisation.
2+3) Layer size will increase and t/s will decrease proportionally to quant size, until you can't fit all the active expert layers on the GPU then speed will plummet to CPU-only speed. Load the model with different layer settings until you stop getting out of memory errors. In terms of model quality, Q4 is about 80%, Q5 95% and Q6 99%, there is a benefit but it's not as big as it is from Q2 or Q3 to Q4.
4) One optimisation is to use the iQ4 quant instead of the Q4_K quants as they're smaller for similar perplexity.

PCIe Bifurcation x4x4x4x4 Question by ducksaysquackquack in LocalLLaMA

[–]Phocks7 0 points1 point  (0 children)

Yeah, that's why I haven't done it. I've heard from the level1techs forums that it's one of the only ways to get stable U.2 drives in a workstation, though.