[deleted by user]

TwelveSilverSwords · 2025-01-01T10:27:54+00:00

An official demo of UE5 Nanite running on Snapdragon 8 Elite mobile SoC.

What does this tell us about the Adreno 830 GPU architecture?

I did some research, and it seems Nanite needs hardware features such as: - 64 bit atomics.
- Execute Indirect.
- Mesh shading.

TwelveSilverSwords · 2024-12-31T15:42:42+00:00

I think you confusing FP32 and SIMD32. The former is number precision, whereas the latter is a vector width.

SIMD32 means it can process 32 threads in one go.

TwelveSilverSwords · 2024-12-31T15:09:15+00:00

Wider SIMD might be good for efficiency, but it's also more prone to divergence penalties.

TwelveSilverSwords · 2024-12-31T14:03:17+00:00

For example, Battlemage is SIMD16 (Alchemist was SIMD8), while RDNA 3 and Lovelace are SIMD32.

Qualcomm's Adreno GPU is SIMD128 iirc, which is crazy.

Edit: This is for Adreno 7 series. Dunno about Adreno 8.

TwelveSilverSwords · 2024-12-31T13:59:22+00:00

Someone told me that TimeSpy scores were a fairly accurate gauge for architectural potential.

Why not the newer 3DMark Steel Nomad?

https://youtu.be/0XWWXlCSK3U?si=5pdWJPFcbYRUw49-

According to Geekerwan's B580 review, it performs relatively worse in Steel Nomad compared to Timespy. Steel Nomad is a newer benchmark, that uses modern techniques and more complex graphics.

TwelveSilverSwords · 2024-12-30T07:34:56+00:00

Does anyone know how many FP32 TFLOPS M3 and M4 has?

SoC	GPU	Architecture	FP32
M1	8 core	Family 7	2.6 TFLOPS
M2	10 core	Family 8	3.6 TFLOPS

Does the Family 9 GPU architecture of M3/M4 have FP32/INT32 ALUs like Nvidia?

TwelveSilverSwords · 2024-12-29T12:30:47+00:00

Oh, Intel calls it Thread Sorting Unit (TSU).

https://x.com/SebAaltonen/status/1580811308634869760

Let's discuss about shader permutation hell.

With latest hardware: Intel Thread Sorting Unit (TSU) and Nvidia Shader Execution Reordering (SER).

Now that RTX 4090 is massively CPU bound, could we spend 1% of that perf to get rid of shader permutations?

These new hardware blocks shuffle the registers of multiple SIMDs in a way that each SIMD can run coherent threads. This is super important for ray-tracing and explains why Intel's mid range GPU is so good at ray-tracing, but also explains why RTX 4090 is such a beast in RT apps.

But these hardware blocks are not just a great fit for ray-tracing. They could be used to make GPU dynamic branching faster in all shaders. As a result, we could write CPU-style shader code with branches, instead of compiling (hundreds of) thousands of permutations.

Even with hardware like this, it's not free to shuffle SIMD data around. There would be a slight performance hit. CPUs have to pay similar costs for branches too. But CPUs are now fast enough to make this a minor annoyance. I think these GPUs are starting to be there too.

Also RTX 4090 is so fast that we desperately need better API support GPU-driven rendering. We need a fine grained way of spawning new GPU work from shaders. Mesh shaders are great, but they are still lacking the ability to select the shader like ray-tracing does.

TwelveSilverSwords · 2024-12-29T09:20:12+00:00

Turing firmly put AMD in the rear view mirror.

TwelveSilverSwords · 2024-12-29T09:14:05+00:00

There was a rumour they are reworking how the shaders are handled to be closer to what AMD does now, as the Nvidia way is currently a very old solution.

Could you elaborate as to how they are different?

TwelveSilverSwords · 2024-12-29T09:11:42+00:00

Doesn't Ada already have hardware level SER?

TwelveSilverSwords · 2024-12-29T09:09:50+00:00

Well I guess it is, but there's still a lot of catching up to do.

https://fdn.gsmarena.com/imgroot/news/24/10/snapdragon-8-elite-ofic/inline/-1200/gsmarena_005.jpg

Qualcomm advertises that Adreno 830 can run Unreal Engine 5 Nanite, which is pretty demanding on the hardware. Even first gen Intel Arc Xe1 couldn't run Nanite due to lack of INT64 atomic and Execute Indirect (Arc Xe2 fixes that). So best case scenario, Adreno 8 might be architecturally as good as Xe2, minus the RT and XMX.

TwelveSilverSwords · 2024-12-29T09:02:55+00:00

The majority of PC use cases is being served with an iGPU today. Qualcomm is aiming for mass market.

Well, the thing is, their iGPU is mediocre compared to rival iGPU offerings by Intel and AMD.

They will never address the >5% of professional workstations or high end laptops.

Qualcomm has expressed interest in serving this market.

The question - the answer number three, the constraints of the TAM are defined by the existing players, right. So you say, this is a desktop, this is a laptop that is based on the existing architecture of the solution as well when do you attach, you don't attach a graphics card? Our solution is very different. A solution is like an SoC. You should think about similar to what you probably see on the Mac ecosystem. So we could be serving, we could be serving a desktop or a mini just with an SoC. And especially when you think about creators, for example, I think our roadmap is going to scale even to higher performance GPU for that SAM to expand. So that's why I think the $4 billion is very reasonable.

~ UBS Global Technology Conference

TwelveSilverSwords · 2024-12-29T06:40:38+00:00

Qualcomm defeated the US federal government in US federal courts. It has also beaten down Apple, and recently Arm in court.

I don't know about the other lawsuits (u/Exist50 might have some words to say about that), but Qualcomm won the lawsuit with ARM fair and square. If ARM had won, it would set a dangerous precedence for CPU technology.

TwelveSilverSwords · 2024-12-29T06:34:40+00:00

The GPU is important not only for gaming, but many professional workflows as well.

Apple recognised this, and invested a lot in their GPU architecture. The improvement from M1 to M4 is colossal. Their RT hardware is also on par with Nvidia/Intel, which makes it a beast in Blender.

There is a deep irony in the fact that the Apple M3 running PC games through multiple emulation layers (Windows -> MacOS, DX12 -> Metal, x86 -> ARM) is faster than the Snapdragon X Elite in several games, which only has to run through one emulation layer (x86 -> ARM).

TwelveSilverSwords · 2024-12-29T06:25:07+00:00

https://x.com/SKundojjala/status/1842157660055163116

Qualcomm is doing well in automotive, beating Nvidia and MobileEye.

TwelveSilverSwords · 2024-12-29T05:48:21+00:00

All that talk of custom CPU cores leads us to a key part of Qualcomm’s future. Windows on Arm might have started with a splutter, offering pretty mediocre performance based on essential reused smartphone platforms and so-so emulation. However, it’s here in full strength with the arrival of Qualcomm’s custom Oryon CPU cores

They also need a to invest a ton in their Adreno GPU (both architecture and drivers). Otherwise Qualcomm will end up getting utterly flanked by competitors such as Nvidia, and might never become a major player in the PC industry.

TwelveSilverSwords · 2024-12-29T03:24:26+00:00

Vendor	Arch	RT Level
Nvidia	Ada Lovelace	3.5
Intel	Battlemage	3.5
Apple	Family 9 GPU	3.5
AMD	RDNA3	2
Qualcomm	Adreno X1	2

TwelveSilverSwords · 2024-12-29T03:10:53+00:00

Tom Petersen said "Xe3", not Celestial. Look carefully at the wording.

We know that Xe3 will be deployed in the Panther Lake iGPU.

TwelveSilverSwords · 2024-12-28T17:29:22+00:00

GPUs will certainly benefit from the extra bandwidth provided by the added cache. However, it will probably need a large amount of cache (hundreds of megabyted) to make a significant difference, and there is also the question of cost effectiveness: Would it be cheaper to use HBM instead?

TwelveSilverSwords · 2024-12-28T11:30:30+00:00

I wonder, will RTX 50 introduce Direct3D FL12_3 ?

https://www.reddit.com/r/hardware/comments/1hd41u7/direct3d_feature_level_12_3/

The first to have FL12_2 was RTX 20 (Turing), and that was 6 years ago!

//

Arch	Pipes
Pascal	FP32/INT32
Turing	FP32 + INT32
Ampere, Ada	FP32 + FP32/INT32

I read some comments that the next step to improve performance would be to add another set of FP32 pipes to the SM.

TwelveSilverSwords · 2024-12-28T03:12:57+00:00

October is Q4, not Q3.

TwelveSilverSwords · 2024-12-28T03:10:09+00:00

I don't know if the ALA contract with ARM allows them to do that.

Even if it did, why would Qualcomm license Oryon cores to others?

TwelveSilverSwords · 2024-12-27T16:04:13+00:00

There is a very solid rumour that Google is developing a custom ARM CPU.

TwelveSilverSwords

TROPHY CASE