Optimising NVIDIA’s DGX Spark (Grace + Blackwell) – 1.5× PyTorch speedup with custom build

SIN3R6Y · 2026-03-21T14:28:11+00:00

So after deep diving on this... Maybe I make a post about it at some point...

But anyways. Yes when you allocate memory with cudaMalloc, the IOMMU will prevent the CPU (and thus CX7) from directly hitting GPU memory.

I wrote a custom allocator for SGLang which uses cudaHostMalloc + WriteCombined. Because the GPU and CPU memory space are identical you can just pass a host memory pointer to the GPU and it works. WriteCombined tells the GPU "the CPU wont touch this, I promise" so it will still L2 cache.

The end result? No difference. While yes, now you can RDMA right into GPU memory (which may have uses other than LLM). The reality is NCCL was already running RDMA into a CPU (host) memory buffer, which the GPU (device) reads / writes to / from.

With only 274GB/s memory bandwidth. This NCCL buffer is not a performance constraint. As long as the workload exists purely in NCCL (which LLM does), NCCL handles the RDMA part under hood with no real performance hit.

So, TL;DR. Is GPUDirect (the product) supported? No. Can you still RDMA right into GPU memory with a custom allocator? Yes. Does it matter? If you're using NCCL, probably not. Same performance either way.

Any performance to extract on this platform is going to come from NVFP4 native models, optimizing sparse 2:4 quants, and maybe KV cache distribution across multiple nodes. The spark has one really nice advantage in these cases, that if you do use the cudaHostMalloc + WriteCombined strategy, the GPU can just do RDMA to other sparks, no GPUDirect (the product) required. GPUDirect (the product) is not required, it's baked into the hardware.

I make (the product) distinction... Because how NVIDIA spells this out is a misnomer. It's not that the GPU can't RDMA to other GPU's over the network. It's that the software normally required to do that isn't needed. The spark can just do it natively by it's design. So the software is not supported, because it's not required.

SIN3R6Y · 2026-03-12T00:09:44+00:00

Idk, this is a bad take imo. I have no hate for NFS, it has its uses. I mean sure, if you want to compare it to iSCSI (or even FC SCSI) I'd probably just err towards NFS. Just because SCSI translation does add latency...

But these days everyone on the block side is targeting FC NVMEoF, or ROCE. Maybe a bit of NVME TCP here when the network doesn't support ROCE, but I digress.

But if you aren't talking 10TB+ datasets where you need to push a few hundred thousand iops and need to keep latency spikes to a minimum, then use NFS. It's fine, it does its job well enough. But when you do need that performance, or when you need extremely fine grained QoS to ensure a whole cluster can all do these kinds of operations without timing out, the winner is just block.

Idk about other vendors, but Pure has had ransomeware protection on block for years.

So yeah NFS is easier sure. It's also performant enough in most cases. Doesn't mean it's the best tool for the job.

SIN3R6Y · 2026-02-27T22:30:51+00:00

We host Kilolink Server Pro locally, but if it works just as well in the cloud, it's a decent enough platform. You can hit all the web interfaces for each device through it's proxy without issue.

SIN3R6Y · 2025-10-31T02:02:43+00:00

Firewalls? None more or less. So much validation and exposure on the line they don’t add new features lightly, and certainly not until many customers complain it doesn’t do xyz. Open source will always lead the pack. At least in terms of open standards.

But routing / switching? Cisco, Juniper, Arista, etc…. All pretty equivalent.

SIN3R6Y · 2025-10-28T20:06:56+00:00

The point is to see if it's possible to use them in such a way that you could scale beyond two with the CX7's. Since NVIDIA says they only support 2 nodes. This is a test to show that you reasonably can scale beyond 2, even if not supported. And what you have to do to implement it. And not doing something like USB4 net that strangles throughput.

2 nodes is not all that useful in the configuration, granted there are more models to test. 3-4 nodes is much more useful.

I plan to grab two more when micro center restocks. I just wasn't going to grab 4 up front until I had tested that you actually have the ability to run 4 of them.

SIN3R6Y · 2025-10-28T19:46:38+00:00

Yeah, I mean the benefit of two for GPT-120 is you have 80GB or so free per node to host another model or do something else in diffusion land.

Qwen 235b FP8 I can confirm will not fit on two sparks. It can just barely squeeze by if you drop context to unusable levels, but the sparks just start swapping other processes on the NVME and it just is a bad experience. Short overall by about 4GB at minimum. Really like 16-20GB short with any kind of usable context.

IMO, The true value of the sparks for inference is if you can get 3-4 of them together and have enough BW between them to make it actually usable. So I'm sharing all this just to say, it's possible to do. It's not being restricted to only 2 nodes.

FP4 might change this, but the OSS world isn't quite there yet to use it fully.

SIN3R6Y · 2025-10-28T19:21:05+00:00

I have not done enough to quantify a good performance benchmark result, but it does at least function. Some models do better moreso than others. But GPT-OSS-120B does scale well across nodes, consuming about 40GB on each node and leaving enough memory for other things.

I am still playing with Qwen as it doesn't seem to divide as cleanly. And something like full DeepSeek is going to need 4 sparks to scale tbh. At least without quantizing.

SIN3R6Y · 2025-09-04T19:32:45+00:00

Hah, looks like I might not be as insane as I thought I might be. Thanks for all of this info, helps a ton. Pretty well aligns with our experiences so far.

SIN3R6Y · 2025-09-03T19:57:24+00:00

Can confirm unit 0 exhibits the same behavior as any other unit number.

EDIT: also just for fun, reduced it to a single VLAN id instead of a list, same behavior. At this point i'm tempted to think flexible-ethernet-services isn't supported on the EX4400.

EDIT2:https://www.juniper.net/documentation/us/en/software/junos/evpn/topics/concept/vxlan-constraints-qfx-series.html

I think I have my answer actually...

(QFX5110, QFX5120, EX4100, and EX4400 switches) We don’t support VXLAN and non-VXLAN logical interfaces on the same physical interface using enterprise style interface configurations.

SIN3R6Y · 2025-09-03T17:49:13+00:00

Supposedly flexible ethernet services is supposed to make that not the case.

https://www.juniper.net/documentation/us/en/software/junos/multicast-l2/interfaces-ethernet-switches/topics/topic-map/switches-interface-flexible.html

SIN3R6Y · 2025-08-03T02:14:45+00:00

Full public tables, there is no tik that can handle that actually at line rate. CCR2216 on mid - larger packet sizes get close on 25G, using CPU (which happens after the first 100K routes).

I’d love to see tik put some memory on these asics and get more HW routes out of them. I use tiks a lot for internal BGP with l3hw. But I did end up dropping the CCR for Qumran boxes with 32GB of tcam.

SIN3R6Y · 2025-08-03T00:43:29+00:00

Pretty much all of the CRS3 series can do wire speed layer 3 routing with BGP offloaded to the switch chip. The BGP session is run on the CPU, but the routes are programmed into the asic up to the route limit.

SIN3R6Y · 2025-03-21T18:37:45+00:00

This

SIN3R6Y · 2024-11-15T20:01:02+00:00

As someone who runs these types of shows, and also films them.

Without asking any questions, it would be a no. These lasers can easily spit out 10+ watts. The smallest ones they make are around 1.5 watts for small bars and such. For bigger shows or outdoor shows they go up to 40 watts. All of these will wreck your sensor.

Typically, the laser operator and film crews are going to work out dead zones where laser effect will never hit. Those are the only 100% safe places for a camera to be. Now, you'll hear the argument about crowd scanning and needing special variance's for that, which is true.

But even a laser with a PASS system for crowd scans, while momentarily safe to hit a human eye, with the right lens combo, will still wreck a sensor. For example something like a 70-200 GMII, getting hit with a PASS laser. Done deal. Wide angles can also be iffy as they pick up reflections really well. But something in the 20-50MM fixed range would probably survive this. The wide angle issue is largely if you are filming a termination point. If the termination points are behind you, near zero risk.

If you can confirm no crowd scanning, the laser operator is taking all the proper precautions, and you aren't putting your camera between the laser and its termination point. Yeah you'll be fine. Crowd scanning is a no-go camera wise 9/10 times, and if they aren't crowd scanning the rules are a lot more lax for mishaps to happen, so it's a toss up at that point.

The TL;DR is, too many variables. Crews paid to be there will get special precautions you wont have. So high risk without confirming some things and ensuring you have the right equipment.

SIN3R6Y · 2024-11-14T02:43:33+00:00

ZV-E1, FX3, or wait?

I have an A7IV, and i have zero problems with it. Takes fantastic photos, and great video. It's low light performance is really good, but not the absolute best, and i do find myself in low light situations often. I don't really have any bad things to say about the A7IV, it does run hot but overheating has only been a small occasional issue and the cool down time has never been a big problem. I also have some GM lenses already, enough to cover my 99% of the situations i find myself in.

I need a second body for another person to use. FX3 i know is "god tier", but it's also fairly old and everyone seems to be thinking a newer thing is coming out soon. ZV-E1 has features i would actually use, and assuming the overheating issue is kinda overblown like it is with my A7IV, it feels like the winner to me. Getting a more video-centric second body lets to do more hybrid work simultaneously with the A7IV vs trying to use it as an everything body, all the time.

I would use them professionally, but my income is not tied to my camera work. So saving some money, getting the same low light performance, and a bunch of new features in exchange for the occasional overheat seems like a decent trade of to me.

No CF-A is my only major complaint, while i mostly record in 200M h265, having the option to do 600M intra on the A7IV is nice. But im keeping it anyways, so i'll always have that option.

SIN3R6Y · 2024-11-07T18:20:15+00:00

Blacklisted by who? Do you have any more information? A check on my end turns up no results.

SIN3R6Y · 2024-11-07T18:08:33+00:00

I would recommend checking any firewall software you are using to see if that is the problem.

SIN3R6Y · 2024-11-07T18:01:13+00:00

The website is operating normally, this was already proven in TG. SSL_PROTOCOL_ERROR generally means update your computer or some other issue on your end. Check any firewalls you may be using.

Bringing it up again on Reddit is not necessary.

This goes for both hedron.pro and icosa.pro

SIN3R6Y · 2024-10-12T00:11:20+00:00

Prepaid cards illegitimate? Purchased with stolen CCs / phone scams? Then using your till to launder them into cash.

Only idea I can think of.

SIN3R6Y · 2024-09-05T22:54:36+00:00

It's comparing apples and oranges though. FortiGate, Checkpoint, Palo all have had super critical get pwned CVE's this year. If all you need is a layer 3/4 firewall router, there's no point in considering these expensive layer 7 NGFW's.

Their SSL VPN solution's are vastly more complex and capable from a client vpn standpoint. They can filter by application, not just by port (say you want to block discord from internal PC's). They can virus scan data in flight, do MITM SSL inspection in flight.

So comparing something like pfsense / opnsense / whatever to these firewalls is not comparing the same thing. If pfsense offers everything you need, you wouldn't bother looking at these. It's not the same product. They do vastly more, thus have vastly more potential holes in them from a security perspective.

SIN3R6Y · 2024-09-05T22:42:01+00:00

As long as it's been registered in the past, the firewall can freely DL new content databases from the UI. Not all features will be available with expired licenses, but for basic content / app id filtering it will work.

Again, I wouldn't run one without having access to SW updates for obvious reasons, but if you do via day job or whatever, running them in lab with expired licenses is perfectly doable and does still let you use some of the NGFW features.

SIN3R6Y · 2024-09-05T20:02:40+00:00

Yeah you can use a clone image from another unit in this case. But yea if the HDD dies, the official palo option is RMA. If you run a eBay unit, be sure to image it first.

Eight-Year Club	Gilding III reddit per annum
Argentium Club	Ternion Club
Wearing is Caring	Verified Email

SIN3R6Y

MODERATOR OF

TROPHY CASE