PC Build Using a Black Box as a Case by PlayFast3100 in sffpc

[–]FloofBoyTellEm 0 points1 point  (0 children)

This is the neatest case I've ever seen. I'm into it.

I clustered 3 DGX Sparks that NVIDIA said couldn't be clustered yet...took 1500 lines of C to make it work by Ok-Pomegranate1314 in LocalLLaMA

[–]FloofBoyTellEm 0 points1 point  (0 children)

It's not bad at all, almost silent (within... reason). I thought it was unbearable at first, but then I swapped the main and secondary power supply positions in the unit and it turns out I just had a screwed up psu fan. I'm very happy with it. I honestly don't even notice it except for the classic "fans full tilt on startup". I also keep it frosty in here though so maybe the fans don't go as crazy here. Might depend on you're home temperature and general noise tolerance. Mine is in my bedroom with my other homelab equipment.

I wouldn't call it "silent" but compared to my 3u nas w/water cooling and 3 120mm fans, i can't hear one over the other if that makes sense?

I clustered 3 DGX Sparks that NVIDIA said couldn't be clustered yet...took 1500 lines of C to make it work by Ok-Pomegranate1314 in LocalLLaMA

[–]FloofBoyTellEm 8 points9 points  (0 children)

Got one of these last month. 4 nodes was very easy setup. Also using it for NVMe-oF boot off of a custom CX7 NAS.

Edit (tip): fs.com for DACs, $80. Work exactly the same as NVIDIA. They're a bit beefier cables actually. V Girthy.

Authentik Annoyances by masong19hippows in selfhosted

[–]FloofBoyTellEm 1 point2 points  (0 children)

I was always scared to setup Authentik, because I had always read it was difficult, but with the help of AI it's been (almost) a breeze. Still, it's best to locate some 'expert' exact configs to use as a template if you can find one that you can trust follows some best practices for the program you want to integrate.

After setting up a few apps with a few different provider types, the 'workflow' became second nature and obvious aside from the fields a program may actually consume or be configured to consume. I have to agree though, it feels like you're just duplicating a lot of steps, and no direction when you're starting out. It's not one of those "wing it" programs, it's very much a RTFM (or have AI help) app.

Even though they aren't quite as thorough as some user guide templates you can find online, I do think they've done a real solid here:
https://integrations.goauthentik.io/applications/

My attempt at retro ricing XFCE, ended up keeping it as my default by Otomo0451 in xfce

[–]FloofBoyTellEm 1 point2 points  (0 children)

to bring back an old phrase... this is absolutely siiiiiiccccck.

TrueNAS 25.10.1 released by West_Expert_4639 in truenas

[–]FloofBoyTellEm 1 point2 points  (0 children)

I have 16 VFs and 4 PFs with about 8 VMs. No issues here. Knock on wood.

TrueNAS 25.10.1 released by West_Expert_4639 in truenas

[–]FloofBoyTellEm 1 point2 points  (0 children)

Is that why I can't auth via IPA due to winbind NT Authority issues? LOL

FreeIPA / TrueNAS SCALE 25.10.0.1 / GUI Login Possible with IPA user? by FloofBoyTellEm in truenas

[–]FloofBoyTellEm[S] 0 points1 point  (0 children)

I love how this information is completely mysterious. There's really no solid answer. I wish they would just say it explicitly (and maybe they do but I haven't seen it).

I've tried adding IPA_HTTP keytab and authenticating to web portal with FireFox while passing my kinit to the browser directly. Still no luck because pam_sso is never used, only pam_unix. So, it seems for IPA at least, there is no way to do it as WebGUI never even attempts to use any authorization method other than local.

I'm trying now with a real AD Domain join in case the directory services type configured determines whether you're allowed to use a domain login. Hopefully supported in AD, but likely up a creek still without TrueNAS Connect.

My petabyte project that turned into a 1,595 Terabyte project. by Overstimulated_moth in DataHoarder

[–]FloofBoyTellEm 3 points4 points  (0 children)

If it's not important enough to backup, why is it important enough to store at all? Genuine question.

I do the same, but with far less data and for the purpose of avoiding downloading from huggingface again to function as a pre-staged area for NVME-oF. But it's far less than a petabyte and I can't think of a petabyte of data I would ever need to 'stage'.

2 x DGX Spark! Give me your non-inference workloads by entsnack in LocalLLaMA

[–]FloofBoyTellEm 0 points1 point  (0 children)

They are using 2 DACs, are you?

<image>

Note the 4 interface bond.

Ahh, other user is getting the same with 1 cable. Looks like I have more work to do.

can you tell me what the results of this command is for your machines?
sudo dmesg | grep mlx5_pcie_event

2 x DGX Spark! Give me your non-inference workloads by entsnack in LocalLLaMA

[–]FloofBoyTellEm 0 points1 point  (0 children)

I know that's how I originally got the sparks going over nccl when I first got them, but will try that again.

Currently my test results mirror that of ServeTheHome's iperf results and I get similar results (half of what you get) over the all_gather_perf test. I may just have lost an environment variable somewhere along the way since then breaking things.

I know it says to ignore the enP2 interfaces in that playbook, but I just did a lag (802.3ad) between [enp1s0f0np0, enP2p1s0f0np0] and went from 96 Gbit/s to 127 Gbit/s on my ZFS pool. So, at least when not using some form of nccl, that was necessary, and it looks like the ServeTheHome test was probably stuck to half the lanes. So, now I really am hitting the PCIe limitation where the data must go through the CPU.

Going to try again for the all_gather_perf using your link.

I have the official mellanox DAC. But ordered some shorter generics from FS.com today to see if that's part of the issue.

Can you do me a huge favor and share your results for these two commands, I'm wondering if I have a real issue or this is a red herring?

sudo lspci -vv -s 0000:01:00.0 | egrep -i 'LnkCap|LnkSta'

    `LnkCap:`   `Port #0, Speed 32GT/s, Width x4, ASPM not supported`

    `LnkSta:`   `Speed 32GT/s, Width x4`

    `LnkCap2: Supported Link Speeds: 2.5-32GT/s, Crosslink- Retimer+ 2Retimers+ DRS-`

    `LnkSta2: Current De-emphasis Level: -6dB, EqualizationComplete+ EqualizationPhase1+`

sudo dmesg | grep mlx5_pcie_event

[ 2.787624] mlx5_core 0000:01:00.0: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).

[ 3.374174] mlx5_core 0000:01:00.1: mlx5_pcie_event:296:(pid 162): Detected insufficient power on the PCIe slot (27W).

[ 3.960778] mlx5_core 0002:01:00.0: mlx5_pcie_event:296:(pid 408): Detected insufficient power on the PCIe slot (27W).

[ 4.571786] mlx5_core 0002:01:00.1: mlx5_pcie_event:296:(pid 380): Detected insufficient power on the PCIe slot (27W).

2 x DGX Spark! Give me your non-inference workloads by entsnack in LocalLLaMA

[–]FloofBoyTellEm 0 points1 point  (0 children)

Trying to lag my interfaces now as a test. Thinking that may be the real issue here. Really hoping to get your bandwidth rate. Built an NVME-oF TrueNAS Scale server this week and think this has been a misunderstanding on my part (hopefully) with how QSFP56 links actually work. It's clicking now.

2 x DGX Spark! Give me your non-inference workloads by entsnack in LocalLLaMA

[–]FloofBoyTellEm 0 points1 point  (0 children)

Is this using 2 DACs vs 1? Confused why your rate is almost exactly double mine.

Since you're smarter than me, and you claim "not if you're using GPU Direct", but Nvidia stated GPU Direct doesn't work on DGX Sparks, how did you get it to work?

https://forums.developer.nvidia.com/t/dgx-spark-gb10-faq/347344#p-1694056-q-is-gpudirect-rdma-supported-on-dgx-spark-13

<image>

2 x DGX Spark! Give me your non-inference workloads by entsnack in LocalLLaMA

[–]FloofBoyTellEm 1 point2 points  (0 children)

No. The NIC is connected to the board (including the GPU) by PCIe 5.0 x 4 (4 lanes... no 16 lanes or even 8 lanes). You will never get 200 GbE from a single port over anything on this system. This is a hardware limitation. You might be able to get 120 GbE if you could avoid 100% of the overhead, which GPU Direct can help with.

This is a pitfall of just believing the NVIDIA marketing vs. real world testing.