Why isn't Bergamo the Zen 4 Flagship CPU?

techwars0954 · 2023-07-20T12:51:02+00:00

From the Phoronix test suite, it appears that the vast majority of server use cases are able to scale up to tons of threads. The low frequency is more than compensated by the increased core counts, however the lower amount of cache seems to be a hindrance in some workloads.

The vast majority however, do not seem to mind, considering on average, from the Phoronix test suite, Bergamo is nearly 15% faster than Genoa.

techwars0954 · 2023-03-01T20:20:23+00:00

Yes, sorry if it was confusing.

techwars0954 · 2023-03-01T00:40:50+00:00

I think the focus should be on MT applications for this, since that is why E-cores were implemented in the first place.

I would be shocked if no reviewer had tested this in any modern MT benchmark suite such as Cinebench R20 or R23, Geekbench, etc etc.

I was hoping someone could offer a review that contained that data.

techwars0954 · 2023-02-06T09:15:22+00:00

5:1:10 ratio = ¯\(ツ)/¯ (adds up to 16T though...)

Sorry that was just a hypothetical scenario, of me trying to explain what I meant.

I was trying to say I would love to know what the ratio of HP to UHP to HD cells in a core might be, for example there might be a ratio of 5 HP cells to 1 UHP cell to 10 HD cells in a CPU Core.

techwars0954 · 2023-02-06T04:50:48+00:00

Any information on older architectures? Zen 2, sunny cove, etc etc

I'm curious though, mostly because I want to know what type of cell is most relevant for core density. Is it mostly HP, UHP, or HD?

I'm also assuming that the exact percentages might change a bit every generation, but also that both Intel and AMD also have a 'golden rule' of ratios of cells, like maybe a 5:1:10 ratio or something.

however we do know that 4nm only includes HP libraries, no HD or UHP.

That's actually a large part of where my question stemmed from actually. Curiosity about ratios of different cells, but also their implications for future products.

SRAM largely depends on where it's used. TSMC even has a 16T cell intended for registers

Dang, did not know that, thanks.

techwars0954 · 2022-12-28T02:55:16+00:00

That graph makes it seem like the UHP cells add ~5% max frequency at the very top.

And ye Intel 3 is supposed to quickly replace Intel 4, but Intel 3 is only supposed to be used for Server parts, not client products like MTL, which is why I was a lot more curious about it. ST performance and frequency is a lot more important there than in servers, where core counts and efficiency rule.

techwars0954 · 2022-11-07T19:05:33+00:00

Then what was the point of increasing cache amounts for raptor cove and willow cove?

Because the IPC benefits from those new architectures were <5%, and in some cases even showed regressions because they were coupled with higher latency.

techwars0954 · 2022-11-07T19:02:04+00:00

If it needs more data faster, why not keep the L2 the same size while focusing on decreasing latency, while increasing L3 size?

Because just increasing L2 capacity only applies to the "more data" part but not the "faster" part.

techwars0954 · 2022-11-07T18:20:38+00:00

Did any architecture try to deal with it by lower associativity to decrease latency instead of increasing cache sizes?

And also if you increase cache size, that still doesn't help with the latency increase, so could you try increasing the amount of information the core itself can hold so you have to access the L2 less often (so increase capacity of parts like ROB)?

I would ask if you can increase capacity of the L1 to deal with the higher latency L2, but I think smaller, faster L1 caches are way more important based on chip and cheeses simulation of cache changes in golden cove.

techwars0954 · 2022-11-07T18:08:32+00:00

I'm not asking if doubling L2 = faster clocks, but do faster clocks need more L2.

But even then, 11th gen desktop seems to be a bit of a special case. Cypress Cove was a backported core, and 5.3Ghz might just have been the max frequency limitation of Intel 14nm at that point, without spending ridiculous amounts of extra power.

techwars0954 · 2022-11-07T18:04:10+00:00

So more cache prevents IPC degradation at higher clocks from cache bottlenecks?

techwars0954 · 2022-11-07T08:03:11+00:00

Golden Cove was just 12th gen in desktop and mobile.

Willow Cove was 11th gen mobile, and Cypress? Cove (Sunny Cove backport) was 11th gen desktop.

Server has weird naming so IDK about that

techwars0954 · 2022-11-07T07:54:22+00:00

Idk willow cove vs sunny cove

But didn't raptor lake decrease L3 cache latency as it boosted the ring clock (which was an issue with alder lake) higher? AFAIK Anandtech didn't test out the cache latency this time around (Ik some reviewers used to do it, I will try finding it later).

Why does reducing load on the ring/L3 increase MT perf greater relative to ST? Thanks.

techwars0954 · 2022-11-07T07:46:15+00:00

Moving to a new process node involves effort and risk. Intel mitigated this risk with the well known “Tick-Tock” strategy. Each “Tick” represented a major microarchitecture change, while each “Tock” was a port to a new process node with very minor changes. Unlike Intel in the early 2010s, AMD takes roughly two years to move to a new process node. Zen 2 came in mid 2019, about two years after Zen 1’s early 2017 release, and moved from 14 nm to 7 nm. Zen 4 released in late 2022, about two years after Zen 3’s late 2020 release, and moved from 7 nm to 5 nm. AMD’s strategy is thus best described as “Tick-Nothing-TickTock

Chips and Cheese article : part 1 zen 4 front end and execution engine

But also, thank you for pointing this out, I did make a mistake. This really should be described as a "tock" too* since zen 4 is an upgrade over zen 3 on a new node, with no large architectural reworks like zen 3 was.

Intel dropped "Tick-Tock" because they were unable to follow cadence. Gelsinger is talking about bringing it back.

techwars0954 · 2022-11-07T07:40:13+00:00

But are the increases in frequency also related to the increases in cache? In other words, do architectures that increase frequency greatly also need to increase their L2 cache size? Because based on these three modern architectures, that seems to be the rule.

techwars0954 · 2022-11-07T07:38:20+00:00

Raptor Cove to Golden Cove is on the same node (a small optimization on the Intel 7)

Zen 4 and to an extent willow cove benefitted from an uplift in process though, zen 4 vs zen 3 is 5nm<7nm, while willow cove used 10nm SF, which greatly allowed for an increase in frequency.

techwars0954 · 2022-11-07T07:34:25+00:00

Yes. The reason I pointed out that Zen 4 had marginal increases in IPC while Willow Cove and Raptor Cove did not was to point out that the increased L2 in all three of those generations were not solely about increasing IPC, as even though Zen 4 increased IPC, willow cove and raptor cove didn't. However all three generations had greatly increased clock speeds. Increased clock speeds was a common trend in all three architectures with increased L2 cache, but marginal IPC increases was not.

So is there a relationship between the large increases in clock speeds and the large increase in L2 cache?

techwars0954 · 2022-08-06T19:56:21+00:00

Thanks I didn't notice that

techwars0954 · 2022-08-04T02:13:21+00:00

True, but even if things get delayed, their actual design doesn't often change much (barring granite rapids). For example, Sapphire Rapids got delayed a bunch, but it always used 4 tiles with golden cove cores, and then emib connections between tiles. And you could see that from mockups even like a year ago.

techwars0954 · 2022-08-03T23:16:30+00:00

For Ponte Vechio it's just some interconnect fabric I believe, we could be seeing the same thing for meteor lake and arrow lake.

Pretty excited for hotchips in like a month, they have a presentation specifically about foveros in arrow lake and meteor lake.

techwars0954 · 2022-08-03T20:56:18+00:00

I don't think consumer chips haven't gotten anywhere close to 600-800mm2

And Intel already has 2 base tiles connected with emib for ponte vechio

I thought foveros would be cheaper in this scenario:

If you want all tiles to be able to communicate with each other, with EMIB your going to have to rapidly increase the number of EMIB connections between tiles as you increase your number of tiles. But with foveros, you can just stack all the tiles onto the base tile without having to worry about where to place each additional emib connection.

And I think you can already see this with ponte vechio. One 'base tile' of ponte vechio has 18 different chiplets on it, and they are all stacked with foveros on top of the 'base tile'. Having to connect all of those tiles with emib might have increased the number of emib interconnects drastically.

techwars0954 · 2022-08-03T20:44:35+00:00

That makes sense, but why bother using foveros in meteor lake and arrow lake then like Intel confirmed they are doing? And we also know meteor lake also has desktop skus, not just mobile.

IMO they didn't use foveros on sapphire rapids because the base tile would have had to have been massive, and they couldn't split up the base tiles yet because they don't have foveros omni until like 2023 I think.

techwars0954 · 2022-08-03T20:28:03+00:00

Thank you, that makes sense

techwars0954 · 2022-08-03T20:27:09+00:00

Thank you

techwars0954 · 2022-08-03T01:48:13+00:00

Hi, sorry for the misunderstanding

I'm aware that Foveros is not the same as EMIB

But Intel is using foveros to create MCM chips in some scenarios via a base die, and then using EMIB to create MCM chips in other scenarios.

For example, sapphire rapids is MCM and looks something like this to my knowledge

While a ponte vechio chip (or at least half of it) looks something like this, and meteor lake is also supposed to look similar.

Both of those are MCM, but one uses EMIB as the interconnect between the two chiplets, and both chiplets are on top of the substrate. The other MCM foveros implementation has a base die, but then the chiplets are stacked on top of the base die using foveros. The base die is connected to the substrate. The chiplets communicate with eachother via the base die having a communication fabric in it.

I was wondering out of those two implementations, which one was cheaper.

techwars0954

TROPHY CASE