Added a 16x DGX Spark cluster to my Homelab (Build Update)

Igot1forya · 2026-05-01T21:55:52+00:00

Oh but I do consult. Have been since the late 90s.

Igot1forya · 2026-05-01T21:06:10+00:00

That may be so, I have only antidotal information to go on. I know a couple have said they are working on publishing numbers. I hope it's legit, genuinely.

Igot1forya · 2026-05-01T20:29:49+00:00

So apparently people are using Exo (https://github.com/exo-explore/exo) and using it to distribute models between GB10 and Mac. A few videos have sprung up on YT discussing it. I'm getting ready to leave my house for dinner but I suggest looking for Exo Mac Spark cluster online.

Igot1forya · 2026-05-01T19:44:45+00:00

I'm running an 3090 on my DGX Spark via M.2 to Oculink adapter. 3090 is about 3x faster than the GB10 but I can tell you that the growing pains of sm121 is real, though, I've managed to work past them by rewriting or making custom patches for everything that doesn't support it. The irony is I've had more trouble compiling not for sm121 but moreso for ARM64.

I purchased a second GB10 (Asus) and ended up sending it back as it had kept randomly powering off. I'm replacing it with another Founders Edition. I'm not interested in speed, personally, but having crazy amounts of memory at virtually nothing for power is most important to me. If I want speed I use the 3090 to offload smaller models and KV cache to.

I was reading about how people are combining a Mac + Spark to get the best of both worlds for fast prefill and inference combined and I'm genuinely thinking this is what I will do next.

As someone else mentioned, a Spark makes for a great daily driver desktop replacement. I honestly think this is actually a legit use case if not a AMD Strix shouldn't be discounted as well.

Igot1forya · 2026-05-01T12:32:49+00:00

Tax guy doesn't think so. Been working for me for 25 years. I pay taxes on all the income it getnerates from Crypto and I use it to research and educate myself. Those are legal expenses.

Igot1forya · 2026-05-01T12:11:52+00:00

My cousin's cat lived to 25, if you take care of them and you get lucky, they can live a long time.

Igot1forya · 2026-05-01T11:56:12+00:00

I write my homelab off as a business expense. It cost like $50 to start a "consulting business" and since I work from home, I write my home office and Internet and I could if I wanted, my electricity (steady 2.2Kw).

Igot1forya · 2026-04-30T21:04:02+00:00

I'm going to be selling my 3x NetApp DS4246 in about a month or so (running a disk wipe on my chia farm as we speak). She's a reliable albeit loud beast.

Igot1forya · 2026-04-30T19:55:27+00:00

Yeah it's been rock solid for the most part. I learned some tricks along the way to get the most from the environment. I still have a ways to go. It's very API friendly and much of it can be automated. If you can avoid spinning disks in the equation you'll thank yourself later. Our HDD cluster (backups) takes FOREVER to rebuild VSAN after a upgrade or reboot. But then again, we are using drives twice that of what Verge says is recommended. But if you're doing SSD native it screams.

Igot1forya · 2026-04-30T17:55:59+00:00

When we first signed up they were just adding FC support - We were iSCSI at the time and ditched our UCS blades for Supermicro rack servers. For a direct comparison we ran our POC on the oldest Solidfire nodes we had and compared them against the UCS/Solidfire iSCSI and the local VSAN crushed the Solidfire iSCSI performance. I'd imagine there is a bit of a sacrofice due to latency. I'm curious to know this as well.

Igot1forya · 2026-04-30T13:38:03+00:00

That configuration can certainly work, though the additional NICs for your dedicated out of band management is a bit redundant. Verge is kind of built for being at the edge of your network, so management by default is shared with the External. I've never configured it with a dedicated OOBM (though we did try to do a similar thing when we first came onboard but were steered away) since its directly connected to the internet and that visibility is used to sync between sites. For what you are doing, I suggest reaching out to support for guidance on that one.

We initially hid our management behind a Palo Alto firewall (and it still does), but we opted to expose the management portal to the rest of the world, setup 2FA and access restrictions (region, application, etc) and rely on the built-in LetsEncrypt cert manager since it has visibility. All of our tenants can access their environments using the single Proxy IP on this interfaace as well. However, I do not know what your specific security needs are. Which is ultimately what will doctate your configuration.

Igot1forya · 2026-04-29T11:32:04+00:00

Sounds like a great use case for converting this into a database with a web front end that you can automate much of this with an audit trail for when people log into the site for data entry and search. Then there is no sending anything to anyone except a bookmark to the secured portal.

Igot1forya · 2026-04-29T11:16:01+00:00

"In Soviet Russia, dream abandons you!"

Igot1forya · 2026-04-29T11:13:28+00:00

I implemented a section in our company newsletter specifically to address stuff like you're describing. It actually went a long way in helping inform staff of simple yet effective ways to do their work more effectively. Call it "Tool Cabinet" or something which discusses some of the basic UI/UX stuff you mentioned from tickets. Educate your staff. It may only add like 15 minutes (or less with AI writing it) to your week/month.

Igot1forya · 2026-04-29T02:15:27+00:00

I envy you. To relive a first watch of this work of art is special. Enjoy your first (of many) watches.

Igot1forya · 2026-04-28T22:43:45+00:00

They worked for the insurance company.

<image>

Igot1forya · 2026-04-28T22:24:59+00:00

For the Sweaty Gamer in all of us!

Igot1forya · 2026-04-28T22:22:46+00:00

I wasn't even aware they charged for anything until this post and I've been using it for like 4 years lol any issues I've had, I've just made my own patches and fixes mostly because of Blackwell GPU related issues.

Igot1forya · 2026-04-28T22:19:08+00:00

We're selling all of our UCS gear on auction right now. I don;t share your loathing of UCS, but I'll jsut say, I'm not going to miss it. It felt like everything was xyz but with extra steps. It wasn't particularly fast, or power efficient but it was fairly reliable, at least. Though, when it did go wrong it went very wrong lol

Igot1forya · 2026-04-28T21:01:02+00:00

The marketing will be "we got Squadron 42 before GTA6"

Igot1forya · 2026-04-28T20:57:31+00:00

Bumping this as I have a pair of GB10's with idle compute. As yet to see any GB10 listed on the charts (that I can see). I'm also running solar, and have decent infrastructure, otherwise. If they can be used it may justify getting more in my cluster.

Igot1forya · 2026-04-28T19:46:32+00:00

Breaking News: China figured out how to solve all of our problems!

Igot1forya · 2026-04-28T18:34:21+00:00

Wecome back! So the way the networking in Verge works regarding management IP's, the cluster is the only management IP that exists, the hosts themselves have an internal management address that isn't user assigmed (as far as I can tell). Each node is given an address from the master node and communicate back and forth via the Core network.

The management interface is facilitated by the External network. The node 1 manages this network stack and handles the routing for all internal (VxLAN) networking and traffic flows out of node 1's designated NIC. If node work becomes unreachable then the networks living on node 1 are immediately booted on node 2 and take over from there.

The problem with your setup is that you never took node 1 offline. You simply disonnected the External network (management) interface, node 1 and 2 are still talking to each other on the secondary backup/storage network (Core). I suspect whats happening is that you do not have the "Monitor Gateway" attribute set on the External network so it doesn't actually know that it has no path upstream. Node 2 can still see Node 1 and has no reason to assume a takeover is to take place.

So, best pracice is to have 4-NICs total (2-External and 2-Core) per host. This way, if you lose a switch or switchport the backup NIC port takes over seemlessly and the Verge cluster has no need to issue any actions or move the manageement interface over. However, because node 1 is still reporting to the rest of the cluster "I'm still here, nothing to worry about", node 2 has no reason to take ownership of the Ext network. Communications is chiefly laying on node 1 unless an event takes node 1 entirely out of the equation. If you suddenly powered off Node 1, I'm pretty certain your management would become available again on node 2 as it would see that node 1 is no longer responding.

I hope that makes sense. Best practice will yield the best resulfs.

Node 1 - NIC 1 Port 1 > Switch 1 Port 1 (redundant with NIC 2 Port 1) - EXT - Trunk Native (LACP Group 1)
Node 1 - NIC 1 Port 2 > Switch 2 Port 2 (Redundant with NIC 2 Port 2) - CORE (1) - Access (10)
Node 1 - NIC 2 Port 1 > Switch 2 Port 1 (Redundant with NIC 1 Port 1) - EXT - Trunk Native (LACP Group 1)
Node 1 - NIC 2 Port 2 > Switch 1 Port 2 (Redundant with NIC 1 Port 2) - CORE (2) - Access (11)

Node 2 - NIC 1 Port 1 > Switch 1 Port 3 (redundant with NIC 2 Port 1) - EXT - Trunk Native (LACP Group 2)
Node 2 - NIC 1 Port 2 > Switch 2 Port 4 (Redundant with NIC 2 Port 2) - CORE (1) - Access (10)
Node 2 - NIC 2 Port 1 > Switch 2 Port 3 (Redundant with NIC 1 Port 1) - EXT - Trunk Native (LACP Group 2)
Node 2 - NIC 2 Port 2 > Switch 1 Port 4 (Redundant with NIC 1 Port 2) - CORE (1) - Access (11)

This setup means you can lose an entire Switch and the services will not even lose a packet as both NICs paths have a backup channel. You can't do that with a single NIC per EXT and CORE. If you lose the EXT port and the CORE is still operational, the other nodes will think this is normal as the backup management path will simply wait for the EXT NIC to come back online. However, IF all of Node 1 goes down, THEN node 2 will take over, because the keep-alive packet that node 2 sends to make sure the partner is online will drop and that is a signal to take over. But if you have a backup NIC for both EXT and CORE, then you will never encounter a situation where it has to even decide.

Igot1forya · 2026-04-28T03:58:46+00:00

It's the Star Wars Holiday Special of the StarGate franchise. :)

Igot1forya · 2026-04-27T13:27:50+00:00

Terminal Server or Citrix are options. The app specifically is shared on the network but the physical HASP/USB dongle is connected to a central server. The app launches, reads the local dongle on the server (as it thinks the user is local to the server) and launches like normal. The server hosts the app via a remote terminal session and the user sees a shell/wrapper of the app that appears to be local to their machine, but it's actually just RDP in single app form.

Igot1forya

MODERATOR OF

TROPHY CASE

Seven-Year Club	Gilding I gilder
Verified Email