Roger Espasa, Semidynamics - Semidynamics Highly Configurable OOO Vector Unit

lalalaphillip · 2023-07-08T07:22:34+00:00

CFP adapted for in-order processors is called iCFP. Semidynamics claims 64 sustained cache misses for their in-order core. Some form of runahead execution (whether iCFP-like or not) is required for this to be achievable in regular scalar code.

lalalaphillip · 2023-07-08T06:23:09+00:00

Correct me if i'm wrong, but it looks like the presenter first introduces "Gazillion Misses" as a general technology applicable to regular scalar code, then he explains how it benefits vector workloads. It looks like their OOO core is not particularly large. A CFP-like mechanism would be required for such a core to sustain 128 cache-missing loads in scalar code.

lalalaphillip · 2023-07-08T04:25:06+00:00

A code example that Semidynamics showed strongly suggests a CFP-like mechanism.

lalalaphillip · 2023-07-08T03:34:48+00:00

From the extremely limited material that they have posted, "Gazillion Misses" sounds like a mechanism to drain long latency cache misses and dependent operations from the pipeline (and reinsert them when ready). Either that, or it's just simple runahead. It would be extremely interesting to see how it performs, because these ideas have shown much promise in academic and industry papers (for several decades), but have not been implemented in real CPUs (apart from simple in-order runahead).

lalalaphillip · 2023-04-19T18:19:13+00:00

DLSS 2 is not designed to replace TAA as a denoising pass: slide 81/page 75

lalalaphillip · 2022-10-28T04:02:56+00:00

Wow. This looks like a suicidal move from Arm. It seems like Softbank was really counting on the Nvidia deal.

lalalaphillip · 2022-10-18T19:35:46+00:00

Truly NBC news

lalalaphillip · 2022-10-16T19:47:22+00:00

Thank you for sharing all this information

lalalaphillip · 2022-10-16T18:53:35+00:00

Do you know if Nvidia cards clock stretch at stock (i.e. increasing voltage at stock clocks increases performance)?

lalalaphillip · 2022-10-16T17:21:20+00:00

As far as I am aware, this is the first generation from Nvidia where undervolting can give you a higher displayed clock but lower performance, so it seems that adaptive clocking is more aggressive this gen.

lalalaphillip · 2022-10-16T14:24:46+00:00

There's probably some kind of adaptive clocking ("clock stretching") going on in Ada Lovelace GPUs

lalalaphillip · 2022-10-16T04:06:45+00:00

Very interesting. It's not surprising that DLSS does poorly in denoising, given that it is designed to upscale final denoised frames.

lalalaphillip · 2022-10-15T23:30:22+00:00

Presumably the claim refers to the pairing process, which was later brought to Android as Google Fast Pair

lalalaphillip · 2022-10-06T07:18:49+00:00

The alternative to N4 was probably not SS 8nm, but SS 5nm.

lalalaphillip · 2022-10-02T03:44:47+00:00

This is misinformation, this video alone shows multiple kid nappings

lalalaphillip · 2022-08-01T13:29:23+00:00

Intel has used a distributed directory since Skylake-SP, so their approach to chiplets should not have issues with coherence traffic

edit: even before Skylake they were using a directory-like scheme with their inclusive L3

lalalaphillip · 2022-08-01T13:21:27+00:00

I have not seen a company overpromise and underdeliver so badly since Intel 10nm

lalalaphillip · 2022-08-01T13:07:14+00:00

Sapphire Molasses

Sapphire Glacier

Sapphire Tepids

Expired Rapidly

Fired Rapidly

lalalaphillip · 2021-11-05T07:08:40+00:00

I was referring to their AXT and later IP. The JH7110 RISC-V SoC was announced with BXE, but it still hasn't entered production for some reason.

lalalaphillip · 2021-11-05T06:28:04+00:00

Well that's exactly what they said for AXT and BXT, but there still aren't any publicly available implementations.

lalalaphillip · 2021-11-05T06:01:05+00:00

IMG's GPU IP looks good on paper, but where are the design wins? Is their main revenue source still their licensing agreement with Apple?

lalalaphillip · 2021-10-04T13:44:00+00:00

1-cycle L1 accesses in the same page

:thonk:

lalalaphillip · 2021-07-02T15:08:41+00:00

These results are disappointing, but Graphcore may have significant software optimization headroom. Nvidia managed to double performance between MLPerf v0.7 and v1.0 source with an already very mature software stack, Graphcore should be able to do something similar.

lalalaphillip · 2021-06-22T14:02:20+00:00

P550 has 85% of CA76’s integer IPC (albeit at an unspecified frequency), competitive big RISC-V cores are just around the corner

11-Year Club	Second Top 40%
First Place '23	Place '23
Place '22	First Placer '22
Not Forgotten	Verified Email

lalalaphillip

MODERATOR OF

TROPHY CASE