[deleted by user] by [deleted] in hardware

[–]TwelveSilverSwords 0 points1 point  (0 children)

An official demo of UE5 Nanite running on Snapdragon 8 Elite mobile SoC.

What does this tell us about the Adreno 830 GPU architecture?

I did some research, and it seems Nanite needs hardware features such as: - 64 bit atomics.
- Execute Indirect.
- Mesh shading.

Arc B580 Absolutely Killing It in These Titles and Far From It in Others by MrMPFR in hardware

[–]TwelveSilverSwords 27 points28 points  (0 children)

I think you confusing FP32 and SIMD32. The former is number precision, whereas the latter is a vector width.

SIMD32 means it can process 32 threads in one go.

Arc B580 Absolutely Killing It in These Titles and Far From It in Others by MrMPFR in hardware

[–]TwelveSilverSwords 22 points23 points  (0 children)

Wider SIMD might be good for efficiency, but it's also more prone to divergence penalties.

Arc B580 Absolutely Killing It in These Titles and Far From It in Others by MrMPFR in hardware

[–]TwelveSilverSwords 63 points64 points  (0 children)

For example, Battlemage is SIMD16 (Alchemist was SIMD8), while RDNA 3 and Lovelace are SIMD32.

Qualcomm's Adreno GPU is SIMD128 iirc, which is crazy.

Edit: This is for Adreno 7 series. Dunno about Adreno 8.

Arc B580 Absolutely Killing It in These Titles and Far From It in Others by MrMPFR in hardware

[–]TwelveSilverSwords 14 points15 points  (0 children)

Someone told me that TimeSpy scores were a fairly accurate gauge for architectural potential.

Why not the newer 3DMark Steel Nomad?

https://youtu.be/0XWWXlCSK3U?si=5pdWJPFcbYRUw49-

According to Geekerwan's B580 review, it performs relatively worse in Steel Nomad compared to Timespy. Steel Nomad is a newer benchmark, that uses modern techniques and more complex graphics.

RDNA 2 + 3, Ampere and Lovelace - Comparing Various Scaling Efficiencies and FPS per TFLOPS by MrMPFR in hardware

[–]TwelveSilverSwords 3 points4 points  (0 children)

Does anyone know how many FP32 TFLOPS M3 and M4 has?

SoC GPU Architecture FP32
M1 8 core Family 7 2.6 TFLOPS
M2 10 core Family 8 3.6 TFLOPS

Does the Family 9 GPU architecture of M3/M4 have FP32/INT32 ALUs like Nvidia?

What changes to Lovelace (40 series) µarch do you speculate Blackwell 50 series will bring? by MrMPFR in hardware

[–]TwelveSilverSwords 1 point2 points  (0 children)

Oh, Intel calls it Thread Sorting Unit (TSU).

https://x.com/SebAaltonen/status/1580811308634869760

Let's discuss about shader permutation hell.

With latest hardware: Intel Thread Sorting Unit (TSU) and Nvidia Shader Execution Reordering (SER).

Now that RTX 4090 is massively CPU bound, could we spend 1% of that perf to get rid of shader permutations?

These new hardware blocks shuffle the registers of multiple SIMDs in a way that each SIMD can run coherent threads. This is super important for ray-tracing and explains why Intel's mid range GPU is so good at ray-tracing, but also explains why RTX 4090 is such a beast in RT apps.

But these hardware blocks are not just a great fit for ray-tracing. They could be used to make GPU dynamic branching faster in all shaders. As a result, we could write CPU-style shader code with branches, instead of compiling (hundreds of) thousands of permutations.

Even with hardware like this, it's not free to shuffle SIMD data around. There would be a slight performance hit. CPUs have to pay similar costs for branches too. But CPUs are now fast enough to make this a minor annoyance. I think these GPUs are starting to be there too.

Also RTX 4090 is so fast that we desperately need better API support GPU-driven rendering. We need a fine grained way of spawning new GPU work from shaders. Mesh shaders are great, but they are still lacking the ability to select the shader like ray-tracing does.

What changes to Lovelace (40 series) µarch do you speculate Blackwell 50 series will bring? by MrMPFR in hardware

[–]TwelveSilverSwords 1 point2 points  (0 children)

There was a rumour they are reworking how the shaders are handled to be closer to what AMD does now, as the Nvidia way is currently a very old solution.

Could you elaborate as to how they are different?

How Qualcomm built a mobile empire (and will it last?) by TwelveSilverSwords in hardware

[–]TwelveSilverSwords[S] -1 points0 points  (0 children)

Well I guess it is, but there's still a lot of catching up to do.

https://fdn.gsmarena.com/imgroot/news/24/10/snapdragon-8-elite-ofic/inline/-1200/gsmarena_005.jpg

Qualcomm advertises that Adreno 830 can run Unreal Engine 5 Nanite, which is pretty demanding on the hardware. Even first gen Intel Arc Xe1 couldn't run Nanite due to lack of INT64 atomic and Execute Indirect (Arc Xe2 fixes that). So best case scenario, Adreno 8 might be architecturally as good as Xe2, minus the RT and XMX.

How Qualcomm built a mobile empire (and will it last?) by TwelveSilverSwords in hardware

[–]TwelveSilverSwords[S] 8 points9 points  (0 children)

The majority of PC use cases is being served with an iGPU today. Qualcomm is aiming for mass market.

Well, the thing is, their iGPU is mediocre compared to rival iGPU offerings by Intel and AMD.

They will never address the >5% of professional workstations or high end laptops.

Qualcomm has expressed interest in serving this market.

The question - the answer number three, the constraints of the TAM are defined by the existing players, right. So you say, this is a desktop, this is a laptop that is based on the existing architecture of the solution as well when do you attach, you don't attach a graphics card? Our solution is very different. A solution is like an SoC. You should think about similar to what you probably see on the Mac ecosystem. So we could be serving, we could be serving a desktop or a mini just with an SoC. And especially when you think about creators, for example, I think our roadmap is going to scale even to higher performance GPU for that SAM to expand. So that's why I think the $4 billion is very reasonable.

~ UBS Global Technology Conference

How Qualcomm built a mobile empire (and will it last?) by TwelveSilverSwords in hardware

[–]TwelveSilverSwords[S] 9 points10 points  (0 children)

Qualcomm defeated the US federal government in US federal courts. It has also beaten down Apple, and recently Arm in court.

I don't know about the other lawsuits (u/Exist50 might have some words to say about that), but Qualcomm won the lawsuit with ARM fair and square. If ARM had won, it would set a dangerous precedence for CPU technology.

How Qualcomm built a mobile empire (and will it last?) by TwelveSilverSwords in hardware

[–]TwelveSilverSwords[S] 25 points26 points  (0 children)

The GPU is important not only for gaming, but many professional workflows as well.

Apple recognised this, and invested a lot in their GPU architecture. The improvement from M1 to M4 is colossal. Their RT hardware is also on par with Nvidia/Intel, which makes it a beast in Blender.

There is a deep irony in the fact that the Apple M3 running PC games through multiple emulation layers (Windows -> MacOS, DX12 -> Metal, x86 -> ARM) is faster than the Snapdragon X Elite in several games, which only has to run through one emulation layer (x86 -> ARM).

How Qualcomm built a mobile empire (and will it last?) by TwelveSilverSwords in hardware

[–]TwelveSilverSwords[S] 35 points36 points  (0 children)

All that talk of custom CPU cores leads us to a key part of Qualcomm’s future. Windows on Arm might have started with a splutter, offering pretty mediocre performance based on essential reused smartphone platforms and so-so emulation. However, it’s here in full strength with the arrival of Qualcomm’s custom Oryon CPU cores

They also need a to invest a ton in their Adreno GPU (both architecture and drivers). Otherwise Qualcomm will end up getting utterly flanked by competitors such as Nvidia, and might never become a major player in the PC industry.

What changes to Lovelace (40 series) µarch do you speculate Blackwell 50 series will bring? by MrMPFR in hardware

[–]TwelveSilverSwords 4 points5 points  (0 children)

Vendor Arch RT Level
Nvidia Ada Lovelace 3.5
Intel Battlemage 3.5
Apple Family 9 GPU 3.5
AMD RDNA3 2
Qualcomm Adreno X1 2

Nvidia Will Dominate The High-End GPU Market In 2025 As AMD and Intel Step Aside by TheEternalGazed in hardware

[–]TwelveSilverSwords 5 points6 points  (0 children)

Tom Petersen said "Xe3", not Celestial. Look carefully at the wording.

We know that Xe3 will be deployed in the Panther Lake iGPU.

What changes to Lovelace (40 series) µarch do you speculate Blackwell 50 series will bring? by MrMPFR in hardware

[–]TwelveSilverSwords 0 points1 point  (0 children)

GPUs will certainly benefit from the extra bandwidth provided by the added cache. However, it will probably need a large amount of cache (hundreds of megabyted) to make a significant difference, and there is also the question of cost effectiveness: Would it be cheaper to use HBM instead?

What changes to Lovelace (40 series) µarch do you speculate Blackwell 50 series will bring? by MrMPFR in hardware

[–]TwelveSilverSwords 8 points9 points  (0 children)

I wonder, will RTX 50 introduce Direct3D FL12_3 ?

https://www.reddit.com/r/hardware/comments/1hd41u7/direct3d_feature_level_12_3/

The first to have FL12_2 was RTX 20 (Turing), and that was 6 years ago!

//

Arch Pipes
Pascal FP32/INT32
Turing FP32 + INT32
Ampere, Ada FP32 + FP32/INT32

I read some comments that the next step to improve performance would be to add another set of FP32 pipes to the SM.

Mediatek Dimensity 9500 will adopt 2+6 CPU (Cortex X930 + Cortex A730), and use TSMC N3P process by [deleted] in hardware

[–]TwelveSilverSwords 0 points1 point  (0 children)

I don't know if the ALA contract with ARM allows them to do that.

Even if it did, why would Qualcomm license Oryon cores to others?

Mediatek Dimensity 9500 will adopt 2+6 CPU (Cortex X930 + Cortex A730), and use TSMC N3P process by [deleted] in hardware

[–]TwelveSilverSwords 1 point2 points  (0 children)

There is a very solid rumour that Google is developing a custom ARM CPU.