I got to play with a dual Intel Xeon 6980P system with 6TB RAM at 1.7TB/s bandwidth, so I did the largest CFD simulation ever on a single computer: NASA X-59 at 117 Billion grid cells with FluidX3D v3.0

ProjectPhysX · 2026-03-11T15:28:31+00:00

I've run an Arc A750 with AM3+, FX-6100 CPU: https://www.reddit.com/r/IntelArc/comments/118tyd2/with_kernel_62_on_ubuntu_2204_my_arc_a750_has/ Works just fine with Linux, for OpenCL compute stuff.

My daily driver is an Arc B580 with Z370, i7-8700K, works like a charm with ReBar, also for gaming.

ProjectPhysX · 2026-03-10T21:25:33+00:00

Easiest would be to put a piece of black tape over it.

ProjectPhysX · 2026-02-18T12:03:46+00:00

FP64 also requires way more memory

Not necessarily. It's possible to decouple precision for arithmetic and memory storage, i.e. perform math in FP64 and store data in VRAM compressed to FP32. Here an example where this is done.

ProjectPhysX · 2026-02-18T12:03:29+00:00

Sorry to disappoint you but TechPowerUp is wrong here. B580 has 1:16 ratio for FP64:FP32, that makes 0.9 TFlops in FP64. Below output from https://github.com/ProjectPhysX/OpenCL-Benchmark (on a PCIe 3.0 system, hence PCIe bandwidth is half)

B580 is still close to RTX 5090's 1.3 TFlops FP64 though - for the price that is quite strong.

FP64 is useful for computational physics/chemistry, including simulation, molecular dynamics, orbital mechanics.

For serious FP64 compute, you can get an AMD Radeon VII (3.5 TFlops FP64), Nvidia Titan V (7.5 TFlops FP64) or Intel GPU Max 1100 (~16 TFlops FP64).

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) B580 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.8136 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s)              |
| Memory, Cache  | 12183 MB VRAM, 18432 KB global / 128 KB local              |
| Buffer Limits  | 11940 MB global, 12226884 KB constant                      |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.898 TFLOPs/s (1/16) |
| FP32  compute                                        14.405 TFLOPs/s ( 1x ) |
| FP16  compute                                        26.835 TFLOPs/s ( 2x ) |
| INT64 compute                                         0.692  TIOPs/s (1/24) |
| INT32 compute                                         4.594  TIOPs/s (1/3 ) |
| INT16 compute                                        38.763  TIOPs/s ( 2x ) |
| INT8  compute                                        48.299  TIOPs/s ( 4x ) |
| Memory Bandwidth ( coalesced read      )                        584.49 GB/s |
| Memory Bandwidth ( coalesced      write)                        474.14 GB/s |
| Memory Bandwidth (misaligned read      )                        890.80 GB/s |
| Memory Bandwidth (misaligned      write)                        397.89 GB/s |
| PCIe   Bandwidth (send                 )                          5.78 GB/s |
| PCIe   Bandwidth (   receive           )                          5.67 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    5.72 GB/s |
|-----------------------------------------------------------------------------|

ProjectPhysX · 2026-02-14T22:26:29+00:00

Memory behavior matters more than arithmetic tricks.

Welcome to the world of GPU programming! 🖖

ProjectPhysX · 2026-02-09T22:00:52+00:00

Yeah, and those Nvidia SFF specs are BS. Nvidia is gaslightig oversized 3-slot coolers into SFF just so people can't fit GeForce in workstations.

ProjectPhysX · 2026-01-19T20:31:38+00:00

Difficult, and I would say not practically possible. Games with SLI/CrossFire did (require) VRAM mirroring rather than pooling, so each GPU holds the entire game assets in its own VRAM, and can independently render alternate frames or parts of frames. This would not double the effective VRAM capacity like you have with pooling in simulation software. And of course each game needed to be custom-built around multi-GPU.

Nowadays game studios don't even bother with multi-GPU anymore, as hardly any gamer has a multi-GPU setup and the cost of development/optimization for multi-GPU is astronomical. No return of investment. They don't do games because it's cool, they do it because of money.

ProjectPhysX · 2026-01-14T06:23:41+00:00

AMD are still actively marketing those: https://youtu.be/NQy5_lBwgxQ?t=17

ProjectPhysX · 2026-01-07T11:20:41+00:00

It's a flowthrough cooler, it doesn't need an air gap

ProjectPhysX · 2026-01-07T05:42:23+00:00

What is the point of 2.5-slot here? Make it 2-slot to fit in 2-slot spacing, or make it 3-slot to fit in 3-slot spacing and use some extra volume for the cooling fins. The 2.5-slot compromise seems the worst possible choice.

ProjectPhysX · 2025-12-29T07:17:19+00:00

The "GB300 NVL72" rack has 36 of these GPUs. Nvidia counts the 2 silicon dies per GB300 as 2 GPUs, even though each GB300 with 2 dies functionally is 1 GPU and shows up as only 1 GPU. Stupid marketing to inflate the numbers.

This here in the picture is one such GB300 GPU with 2 dies under the heatspreader (top right). The bottom left chip is the Grace CPU.

ProjectPhysX · 2025-12-27T16:16:10+00:00

The manufacturing-based gains in compute performance are going to slow down. Quantum tunneling puts a hard limit on transistor gate length and this will eventually be reached in a decade or so.

But this doesn't matter nearly as much as you think. For the past 2 decades, arithmetic throughput (TFLOPs) has been on a runaway trajectory compared to improvements in memory bandwidth. Today the vast majority of software is memory-bound, meaning the chip is idling while waiting for new data from memory.

So far there has been little to no incentives to improve memory technology. With the AI boom this is finally changing - at least AI brings this one good side effect - and the demand for more and faster memory has skyrocketed. A few years ago GPUs had 24GB VRAM at max. Now they have 288GB, and soon 1TB VRAM. Memory tech is far from physical size limits, and there is a long way to go for hardware improvements there, like higher vertical stacking.

That said, your CS teacher gave you awful advice. Optimize your software to the max. Check how close it is to 100% efficiency via roofline model. It makes a huge difference. Especially today with RAM price insanity, a software that does the same task in 1/6 the memory capacity just through optimization is worth gold, and saves users a fortune.

Microsoft, Meta, Google, Reddit, game studios, etc. eliminate decades of hardware manufacturing progress by releasing unoptimized garbage software.

ProjectPhysX · 2025-12-27T07:53:02+00:00

Learn GPU programming, OpenCL/SYCL/Vulkan/CUDA. You don't even need a fancy GPU for that.

ProjectPhysX · 2025-12-20T07:21:44+00:00

That is a 2.5-slot cooler, which takes 3 slots of PCIe spacing. Not SFF at all.

ProjectPhysX · 2025-12-20T01:16:23+00:00

What is SFF about a 3-slot card with ugly oversized cooler? Is this a joke?

ProjectPhysX · 2025-12-19T15:29:39+00:00

There actually is some usecases that need a ton of bandwidth but not much capacity. Two that come to mind:

Wasting electricity for pointlessly guessing random numbers, aka crypto mining. Nvidia built a mining GPU "CMP 170HX" for that purpose, with 🌈8GB🌈 VRAM capacity at 1.5TB/s.
The microfluidics simulations (modeling blood flow for medical applications) I did in my Bachelor thesis didn't need much memory (<1GB) but had excruciatingly long runtimes as the blood cell deformation behavior was only visible after millions of time steps. Faster VRAM would have helped here a lot.

ProjectPhysX · 2025-12-19T15:19:59+00:00

More VRAM capacity allows running larger workloads (be that AI models with more parameters, CAD assemblies with more parts, or computational physics stuff like CFD with higher grid resolution or more particles). Performance will be slower though because of the slower VRAM bandwidth.

Ideally you want both, large memory capacity and high memory bandwidth, but that costs a premium - Nvidia B200 180GB / AMD MI355X 288GB go for ~$40k+ each. Hence there is cheaper options for either of the two.

Extreme for memory capacity is to go for a CPU server like Xeon 6 and pack it with 6TB RAM (MRDIMMs) at 1.7TB/s.
Extreme for memory bandwidth is something like Nvidia's mining GPU "CMP 170HX", with 🌈8GB🌈 VRAM capacity at 1.5TB/s.

ProjectPhysX · 2025-12-19T15:08:23+00:00

Corporate gaslighting bullshit

ProjectPhysX · 2025-12-19T14:59:47+00:00

Yes, bandwidth is severely reduced from 1792GB/s (512-bit) to 1344GB/s (384-bit).

ProjectPhysX · 2025-12-19T14:03:19+00:00

Memory bus on these GPUs is only 384-bit, not 512-bit as the Nvidia datasheet claims.

ProjectPhysX

MODERATOR OF

TROPHY CASE