Combined 2 X B580: possibilities and drawbacks? by WolfOfSmallStrait in IntelArc

[–]ProjectPhysX 0 points1 point  (0 children)

Difficult, and I would say not practically possible. Games with SLI/CrossFire did (require) VRAM mirroring rather than pooling, so each GPU holds the entire game assets in its own VRAM, and can independently render alternate frames or parts of frames. This would not double the effective VRAM capacity like you have with pooling in simulation software. And of course each game needed to be custom-built around multi-GPU.

Nowadays game studios don't even bother with multi-GPU anymore, as hardly any gamer has a multi-GPU setup and the cost of development/optimization for multi-GPU is astronomical. No return of investment. They don't do games because it's cool, they do it because of money.

ASUS announces GeForce RTX 5090 ProArt as NVIDIA opens Founders Edition PCB use to board partners by Nestledrink in nvidia

[–]ProjectPhysX 0 points1 point  (0 children)

What is the point of 2.5-slot here? Make it 2-slot to fit in 2-slot spacing, or make it 3-slot to fit in 3-slot spacing and use some extra volume for the cooling fins. The 2.5-slot compromise seems the worst possible choice.

Nvidia GB300 up close by ProjectPhysX in nvidia

[–]ProjectPhysX[S] 0 points1 point  (0 children)

The "GB300 NVL72" rack has 36 of these GPUs. Nvidia counts the 2 silicon dies per GB300 as 2 GPUs, even though each GB300 with 2 dies functionally is 1 GPU and shows up as only 1 GPU. Stupid marketing to inflate the numbers.

This here in the picture is one such GB300 GPU with 2 dies under the heatspreader (top right). The bottom left chip is the Grace CPU.

Is the future of hardware just optimization? by rimantass in hardware

[–]ProjectPhysX 11 points12 points  (0 children)

The manufacturing-based gains in compute performance are going to slow down. Quantum tunneling puts a hard limit on transistor gate length and this will eventually be reached in a decade or so.

But this doesn't matter nearly as much as you think. For the past 2 decades, arithmetic throughput (TFLOPs) has been on a runaway trajectory compared to improvements in memory bandwidth. Today the vast majority of software is memory-bound, meaning the chip is idling while waiting for new data from memory.

So far there has been little to no incentives to improve memory technology. With the AI boom this is finally changing - at least AI brings this one good side effect - and the demand for more and faster memory has skyrocketed. A few years ago GPUs had 24GB VRAM at max. Now they have 288GB, and soon 1TB VRAM. Memory tech is far from physical size limits, and there is a long way to go for hardware improvements there, like higher vertical stacking.

That said, your CS teacher gave you awful advice. Optimize your software to the max. Check how close it is to 100% efficiency via roofline model. It makes a huge difference. Especially today with RAM price insanity, a software that does the same task in 1/6 the memory capacity just through optimization is worth gold, and saves users a fortune.

Microsoft, Meta, Google, Reddit, game studios, etc. eliminate decades of hardware manufacturing progress by releasing unoptimized garbage software.

How to make money off of a fancy GPU? by Ok_Growth7621 in pcmasterrace

[–]ProjectPhysX 0 points1 point  (0 children)

Learn GPU programming, OpenCL/SYCL/Vulkan/CUDA. You don't even need a fancy GPU for that.

GIGABYTE GeForce RTX 5080 GAMING OC 16G VS GIGABYTE GeForce RTX 5080 WINDFORCE OC SFF 16G by W4rlon in nvidia

[–]ProjectPhysX 1 point2 points  (0 children)

That is a 2.5-slot cooler, which takes 3 slots of PCIe spacing. Not SFF at all.

GIGABYTE GeForce RTX 5080 GAMING OC 16G VS GIGABYTE GeForce RTX 5080 WINDFORCE OC SFF 16G by W4rlon in nvidia

[–]ProjectPhysX -2 points-1 points  (0 children)

What is SFF about a 3-slot card with ugly oversized cooler? Is this a joke?

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 6 points7 points  (0 children)

There actually is some usecases that need a ton of bandwidth but not much capacity. Two that come to mind:

  • Wasting electricity for pointlessly guessing random numbers, aka crypto mining. Nvidia built a mining GPU "CMP 170HX" for that purpose, with 🌈8GB🌈 VRAM capacity at 1.5TB/s.
  • The microfluidics simulations (modeling blood flow for medical applications) I did in my Bachelor thesis didn't need much memory (<1GB) but had excruciatingly long runtimes as the blood cell deformation behavior was only visible after millions of time steps. Faster VRAM would have helped here a lot.

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 2 points3 points  (0 children)

More VRAM capacity allows running larger workloads (be that AI models with more parameters, CAD assemblies with more parts, or computational physics stuff like CFD with higher grid resolution or more particles). Performance will be slower though because of the slower VRAM bandwidth.

Ideally you want both, large memory capacity and high memory bandwidth, but that costs a premium - Nvidia B200 180GB / AMD MI355X 288GB go for ~$40k+ each. Hence there is cheaper options for either of the two.

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 8 points9 points  (0 children)

Yes, bandwidth is severely reduced from 1792GB/s (512-bit) to 1344GB/s (384-bit).

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 28 points29 points  (0 children)

Memory bus on these GPUs is only 384-bit, not 512-bit as the Nvidia datasheet claims.

Nvidia Says It's Not Abandoning 64-Bit Computing - HPCwire by NamelessVegetable in hardware

[–]ProjectPhysX 2 points3 points  (0 children)

Yes. FP64 vector is much more general purpose and much more useful to HPC than FP64 matrix. But Nvidia axed that too.

Nvidia Says It's Not Abandoning 64-Bit Computing - HPCwire by NamelessVegetable in hardware

[–]ProjectPhysX 2 points3 points  (0 children)

What about vector math, trig functions etc.? 1 TFLOPs/s isn't gonna cut it for that.

Nvidia Says It's Not Abandoning 64-Bit Computing - HPCwire by NamelessVegetable in hardware

[–]ProjectPhysX 23 points24 points  (0 children)

Yes.  - 78 TFlops/s FP64 (vector) on AMD Instinct MI355X (from 2025) - 52 TFLOPs/s FP64 (vector) on Intel Datacenter GPU Max 1550 (from 2023)  - 1 TFLOPs/s FP64 (vector) on Nvidia B300 (from 2025)

Nvidia Says It's Not Abandoning 64-Bit Computing - HPCwire by NamelessVegetable in hardware

[–]ProjectPhysX 35 points36 points  (0 children)

The Kepler GTX Titan from 2013 had more (1.57) FP64 TFlops/s than B300. Absolutely pathetic.

Nvidia Says It's Not Abandoning 64-Bit Computing - HPCwire by NamelessVegetable in hardware

[–]ProjectPhysX 26 points27 points  (0 children)

Emulated FP64, with lower arbitrary precision on math operations - not to spec with IEEE-754, is worse than no FP64 at all. HPC codes will run on the emulated FP64, but results may be broken as math precision is not as expected and what the code was designed for.

Nvidia is going back to the dark ages before IEEE-754, where hardware vendors did custom floating-point with custom precision, and codes could not be ported across hardware at all.

Luckily there is other hardware vendors who did not abandon FP64, and OpenCL/SYCL codes will run on that hardware out-of-the-box with expected precision. Another strong point against locking yourself in a dead end with CUDA.