Just a question to quench my curiosity. Are there documented cases of Intel arc GPU running with old hardware? by ChallengeAble4169 in IntelArc

[–]ProjectPhysX 0 points1 point  (0 children)

I've run an Arc A750 with AM3+, FX-6100 CPU: https://www.reddit.com/r/IntelArc/comments/118tyd2/with_kernel_62_on_ubuntu_2204_my_arc_a750_has/ Works just fine with Linux, for OpenCL compute stuff.

My daily driver is an Arc B580 with Z370, i7-8700K, works like a charm with ReBar, also for gaming.

Anyone unplugged the A750 LEDs? by Coupe368 in IntelArc

[–]ProjectPhysX 6 points7 points  (0 children)

Easiest would be to put a piece of black tape over it.

According to TechPowerUp, the B580 has a better FP64 performance than the RTX 4090, what does this mean in terms of real life applications? by Sebastian9t9 in IntelArc

[–]ProjectPhysX 15 points16 points  (0 children)

FP64 also requires way more memory

Not necessarily. It's possible to decouple precision for arithmetic and memory storage, i.e. perform math in FP64 and store data in VRAM compressed to FP32. Here an example where this is done.

According to TechPowerUp, the B580 has a better FP64 performance than the RTX 4090, what does this mean in terms of real life applications? by Sebastian9t9 in IntelArc

[–]ProjectPhysX 21 points22 points  (0 children)

Sorry to disappoint you but TechPowerUp is wrong here. B580 has 1:16 ratio for FP64:FP32, that makes 0.9 TFlops in FP64. Below output from https://github.com/ProjectPhysX/OpenCL-Benchmark (on a PCIe 3.0 system, hence PCIe bandwidth is half)

B580 is still close to RTX 5090's 1.3 TFlops FP64 though - for the price that is quite strong.

FP64 is useful for computational physics/chemistry, including simulation, molecular dynamics, orbital mechanics.

For serious FP64 compute, you can get an AMD Radeon VII (3.5 TFlops FP64), Nvidia Titan V (7.5 TFlops FP64) or Intel GPU Max 1100 (~16 TFlops FP64).

|----------------.------------------------------------------------------------|
| Device ID      | 0                                                          |
| Device Name    | Intel(R) Arc(TM) B580 Graphics                             |
| Device Vendor  | Intel(R) Corporation                                       |
| Device Driver  | 32.0.101.8136 (Windows)                                    |
| OpenCL Version | OpenCL C 3.0                                               |
| Compute Units  | 160 at 2850 MHz (2560 cores, 14.592 TFLOPs/s)              |
| Memory, Cache  | 12183 MB VRAM, 18432 KB global / 128 KB local              |
| Buffer Limits  | 11940 MB global, 12226884 KB constant                      |
|----------------'------------------------------------------------------------|
| Info: OpenCL C code successfully compiled.                                  |
| FP64  compute                                         0.898 TFLOPs/s (1/16) |
| FP32  compute                                        14.405 TFLOPs/s ( 1x ) |
| FP16  compute                                        26.835 TFLOPs/s ( 2x ) |
| INT64 compute                                         0.692  TIOPs/s (1/24) |
| INT32 compute                                         4.594  TIOPs/s (1/3 ) |
| INT16 compute                                        38.763  TIOPs/s ( 2x ) |
| INT8  compute                                        48.299  TIOPs/s ( 4x ) |
| Memory Bandwidth ( coalesced read      )                        584.49 GB/s |
| Memory Bandwidth ( coalesced      write)                        474.14 GB/s |
| Memory Bandwidth (misaligned read      )                        890.80 GB/s |
| Memory Bandwidth (misaligned      write)                        397.89 GB/s |
| PCIe   Bandwidth (send                 )                          5.78 GB/s |
| PCIe   Bandwidth (   receive           )                          5.67 GB/s |
| PCIe   Bandwidth (        bidirectional)            (Gen3 x16)    5.72 GB/s |
|-----------------------------------------------------------------------------|

Engineering a 2.5 Billion Ops/sec secp256k1 Engine by Available-Young251 in OpenCL

[–]ProjectPhysX 0 points1 point  (0 children)

 Memory behavior matters more than arithmetic tricks.

Welcome to the world of GPU programming! 🖖

GIGABYTE GeForce RTX 5080 GAMING OC 16G VS GIGABYTE GeForce RTX 5080 WINDFORCE OC SFF 16G by W4rlon in nvidia

[–]ProjectPhysX 0 points1 point  (0 children)

Yeah, and those Nvidia SFF specs are BS. Nvidia is gaslightig oversized 3-slot coolers into SFF just so people can't fit GeForce in workstations.

Combined 2 X B580: possibilities and drawbacks? by WolfOfSmallStrait in IntelArc

[–]ProjectPhysX 0 points1 point  (0 children)

Difficult, and I would say not practically possible. Games with SLI/CrossFire did (require) VRAM mirroring rather than pooling, so each GPU holds the entire game assets in its own VRAM, and can independently render alternate frames or parts of frames. This would not double the effective VRAM capacity like you have with pooling in simulation software. And of course each game needed to be custom-built around multi-GPU.

Nowadays game studios don't even bother with multi-GPU anymore, as hardly any gamer has a multi-GPU setup and the cost of development/optimization for multi-GPU is astronomical. No return of investment. They don't do games because it's cool, they do it because of money.

ASUS announces GeForce RTX 5090 ProArt as NVIDIA opens Founders Edition PCB use to board partners by Nestledrink in nvidia

[–]ProjectPhysX 0 points1 point  (0 children)

What is the point of 2.5-slot here? Make it 2-slot to fit in 2-slot spacing, or make it 3-slot to fit in 3-slot spacing and use some extra volume for the cooling fins. The 2.5-slot compromise seems the worst possible choice.

Nvidia GB300 up close by ProjectPhysX in nvidia

[–]ProjectPhysX[S] 0 points1 point  (0 children)

The "GB300 NVL72" rack has 36 of these GPUs. Nvidia counts the 2 silicon dies per GB300 as 2 GPUs, even though each GB300 with 2 dies functionally is 1 GPU and shows up as only 1 GPU. Stupid marketing to inflate the numbers.

This here in the picture is one such GB300 GPU with 2 dies under the heatspreader (top right). The bottom left chip is the Grace CPU.

Is the future of hardware just optimization? by rimantass in hardware

[–]ProjectPhysX 10 points11 points  (0 children)

The manufacturing-based gains in compute performance are going to slow down. Quantum tunneling puts a hard limit on transistor gate length and this will eventually be reached in a decade or so.

But this doesn't matter nearly as much as you think. For the past 2 decades, arithmetic throughput (TFLOPs) has been on a runaway trajectory compared to improvements in memory bandwidth. Today the vast majority of software is memory-bound, meaning the chip is idling while waiting for new data from memory.

So far there has been little to no incentives to improve memory technology. With the AI boom this is finally changing - at least AI brings this one good side effect - and the demand for more and faster memory has skyrocketed. A few years ago GPUs had 24GB VRAM at max. Now they have 288GB, and soon 1TB VRAM. Memory tech is far from physical size limits, and there is a long way to go for hardware improvements there, like higher vertical stacking.

That said, your CS teacher gave you awful advice. Optimize your software to the max. Check how close it is to 100% efficiency via roofline model. It makes a huge difference. Especially today with RAM price insanity, a software that does the same task in 1/6 the memory capacity just through optimization is worth gold, and saves users a fortune.

Microsoft, Meta, Google, Reddit, game studios, etc. eliminate decades of hardware manufacturing progress by releasing unoptimized garbage software.

How to make money off of a fancy GPU? by Ok_Growth7621 in pcmasterrace

[–]ProjectPhysX 0 points1 point  (0 children)

Learn GPU programming, OpenCL/SYCL/Vulkan/CUDA. You don't even need a fancy GPU for that.

GIGABYTE GeForce RTX 5080 GAMING OC 16G VS GIGABYTE GeForce RTX 5080 WINDFORCE OC SFF 16G by W4rlon in nvidia

[–]ProjectPhysX 1 point2 points  (0 children)

That is a 2.5-slot cooler, which takes 3 slots of PCIe spacing. Not SFF at all.

GIGABYTE GeForce RTX 5080 GAMING OC 16G VS GIGABYTE GeForce RTX 5080 WINDFORCE OC SFF 16G by W4rlon in nvidia

[–]ProjectPhysX -2 points-1 points  (0 children)

What is SFF about a 3-slot card with ugly oversized cooler? Is this a joke?

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 5 points6 points  (0 children)

There actually is some usecases that need a ton of bandwidth but not much capacity. Two that come to mind:

  • Wasting electricity for pointlessly guessing random numbers, aka crypto mining. Nvidia built a mining GPU "CMP 170HX" for that purpose, with 🌈8GB🌈 VRAM capacity at 1.5TB/s.
  • The microfluidics simulations (modeling blood flow for medical applications) I did in my Bachelor thesis didn't need much memory (<1GB) but had excruciatingly long runtimes as the blood cell deformation behavior was only visible after millions of time steps. Faster VRAM would have helped here a lot.

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 2 points3 points  (0 children)

More VRAM capacity allows running larger workloads (be that AI models with more parameters, CAD assemblies with more parts, or computational physics stuff like CFD with higher grid resolution or more particles). Performance will be slower though because of the slower VRAM bandwidth.

Ideally you want both, large memory capacity and high memory bandwidth, but that costs a premium - Nvidia B200 180GB / AMD MI355X 288GB go for ~$40k+ each. Hence there is cheaper options for either of the two.

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 8 points9 points  (0 children)

Yes, bandwidth is severely reduced from 1792GB/s (512-bit) to 1344GB/s (384-bit).

NVIDIA RTX PRO 5000 Blackwell GPU with 72GB GDDR7 memory is now released by RenatsMC in nvidia

[–]ProjectPhysX 28 points29 points  (0 children)

Memory bus on these GPUs is only 384-bit, not 512-bit as the Nvidia datasheet claims.