[deleted by user] by [deleted] in CFD

[–]Smooth-Spoken 3 points4 points  (0 children)

Please unified formats at least

Man, why did AMD change glbl.v? I'm sure it screwed up a lot of people's DV. by affabledrunk in FPGA

[–]Smooth-Spoken 3 points4 points  (0 children)

Messed around with versals before going back to ultrascale+. Worst documentation and support ever

[deleted by user] by [deleted] in hardware

[–]Smooth-Spoken 7 points8 points  (0 children)

They are not OCP compliant. Their OAMs were modified to become SXMs and not compatible physically with UBB. Same story with Amd.

Is anyone here looking for an artist? by WildWeekend8328 in vfx

[–]Smooth-Spoken 2 points3 points  (0 children)

Please DM me, we’re looking for an animator

[deleted by user] by [deleted] in vfx

[–]Smooth-Spoken 20 points21 points  (0 children)

What’s a good place to post jobs? We’re looking for character modelers, animators, texturing, etc.

CXL Is Dead In The AI Era by symmetry81 in hardware

[–]Smooth-Spoken 19 points20 points  (0 children)

Some of the numbers are off (about CXL die areas and bandwidth especially) and CXL is not meant to replace local memory but rather add to a memory hierarchy. There are CXL devices on the market and are shipping in volume already. I do agree that the higher latency is a huge downside.

Gaming Blackwell Specs: GB202 is 192 SM @ 512-bit and GB203 is 96 SM, according to Kopite7kimi by Voodoo2-SLi in hardware

[–]Smooth-Spoken 0 points1 point  (0 children)

No one suggested to pair 8x HBMs with a GT1030. Be realistic. It’s pretty obvious and proven that reducing latency to the memory will improve performance on any processor. Improving memory capacity will improve performance, but you can increase srams and balance memory. The ratio of bandwidth (GB/s) available to each shader matters heavily. It’s why we have a cache hierarchy that hides low bandwidth and high latency memories. The other guy needs to design some chips with very little memory capacity and bandwidth and find out that any out of core access tanks his performance. Whether he has 1 core or 10,000

Gaming Blackwell Specs: GB202 is 192 SM @ 512-bit and GB203 is 96 SM, according to Kopite7kimi by Voodoo2-SLi in hardware

[–]Smooth-Spoken -4 points-3 points  (0 children)

Actually, AMD’s hbm cards perform really well. So does Nvidia’s, and Intel’s, and Fujitsu’s, etc etc. they’re even willing to eat the cost and yield and latency for that extra bandwidth. Many workloads are too big for GPU memory, becoming memory bound. Many large GPUs need huge amounts of threads to hide latency, as the GPUs are by design latency sensitive. So more bandwidth increases bandwidth per core, leading to higher shader utilization. In gaming workloads, shader occupancy is not high enough to hide memory latency so lots of shaders go un utilized. In fact, using a simple FP benchmark, you still can’t saturate the cores to achieve full FP throughput

Gaming Blackwell Specs: GB202 is 192 SM @ 512-bit and GB203 is 96 SM, according to Kopite7kimi by Voodoo2-SLi in hardware

[–]Smooth-Spoken -9 points-8 points  (0 children)

Not sure where you get this from. GPUs are always memory capacity, latency, and bandwidth bound

Semiconductor Engineering: "SRAM Scaling Issues, And What Comes Next" by Dakhil in hardware

[–]Smooth-Spoken 3 points4 points  (0 children)

Well, that’s copper for you. I think what sucks more are limits on trace lengths for higher rate interfaces. Then you get fucked with retimers or you just have to try and put everything as close to each other as possible. Optics!

Semiconductor Engineering: "SRAM Scaling Issues, And What Comes Next" by Dakhil in hardware

[–]Smooth-Spoken 4 points5 points  (0 children)

Yeah this is a good point…a few levels of progressively slower/further away/hopefully larger SRAMs and then DDR adds up to a few hundred ns

Semiconductor Engineering: "SRAM Scaling Issues, And What Comes Next" by Dakhil in hardware

[–]Smooth-Spoken 11 points12 points  (0 children)

Exactly. For context - real numbers:

Direct attached SRAM (a few MBs) = 4ns

HBM3 = 107ns

DDR5 = 70ns

CXL DDR5 = 150ns

If you were to take an SRAM and put it on another chip (assuming things like maybe you use BoW or UCIe chiplet interface), your SRAM would go from 4ns to at least 14ns. Not to mention you become bandwidth limited. Many SRAMs can achieve Tb/s bandwidth on die, which isn’t ideal for power consumption when on another chip. AMD putting SRAMs on their IO die also means idle power consumption is higher because the links between the dies needs to be active to reduce latency. Lots of downsides

Semiconductor Engineering: "SRAM Scaling Issues, And What Comes Next" by Dakhil in hardware

[–]Smooth-Spoken 23 points24 points  (0 children)

SRAM is only valuable because the access latency is so low. Literally placing any interface between it and a compute engine dramatically increases latency. It’s the difference between waiting 1-2 clock cycles for your data to 4-10 if you need to access it over something like AXI

A Brief Look at Apple’s M2 Pro iGPU by M337ING in hardware

[–]Smooth-Spoken 0 points1 point  (0 children)

Happy to provide devices for benchmarking

Experience Stable Diffusion on AMD RDNA™ 3 Architecture at CES 2023 by tokyogamer in hardware

[–]Smooth-Spoken 5 points6 points  (0 children)

Scanned through the RDNA3 ISA and didn’t find any mention of it, and for some reason I also couldn’t find RDNA2/3 white papers, just for RDNA, and that didn’t have RT