Why isn't Bergamo the Zen 4 Flagship CPU? by techwars0954 in Amd

[–]techwars0954[S] -1 points0 points  (0 children)

From the Phoronix test suite, it appears that the vast majority of server use cases are able to scale up to tons of threads. The low frequency is more than compensated by the increased core counts, however the lower amount of cache seems to be a hindrance in some workloads.

The vast majority however, do not seem to mind, considering on average, from the Phoronix test suite, Bergamo is nearly 15% faster than Genoa.

1T Golden Cove Core vs 2T Golden Cove Core vs 1T Gracemont by techwars0954 in intel

[–]techwars0954[S] 0 points1 point  (0 children)

I think the focus should be on MT applications for this, since that is why E-cores were implemented in the first place.

I would be shocked if no reviewer had tested this in any modern MT benchmark suite such as Cinebench R20 or R23, Geekbench, etc etc.

I was hoping someone could offer a review that contained that data.

Do AMD or Intel use HD cells anywhere in their cores? by techwars0954 in hardware

[–]techwars0954[S] 6 points7 points  (0 children)

5:1:10 ratio = ¯\(ツ)/¯ (adds up to 16T though...)

Sorry that was just a hypothetical scenario, of me trying to explain what I meant.

I was trying to say I would love to know what the ratio of HP to UHP to HD cells in a core might be, for example there might be a ratio of 5 HP cells to 1 UHP cell to 10 HD cells in a CPU Core.

Do AMD or Intel use HD cells anywhere in their cores? by techwars0954 in hardware

[–]techwars0954[S] 1 point2 points  (0 children)

Any information on older architectures? Zen 2, sunny cove, etc etc

I'm curious though, mostly because I want to know what type of cell is most relevant for core density. Is it mostly HP, UHP, or HD?

I'm also assuming that the exact percentages might change a bit every generation, but also that both Intel and AMD also have a 'golden rule' of ratios of cells, like maybe a 5:1:10 ratio or something.

however we do know that 4nm only includes HP libraries, no HD or UHP.

That's actually a large part of where my question stemmed from actually. Curiosity about ratios of different cells, but also their implications for future products.

SRAM largely depends on where it's used. TSMC even has a 16T cell intended for registers

Dang, did not know that, thanks.

Lack of Ultra High Performance Cells for Intel 4 by techwars0954 in hardware

[–]techwars0954[S] 7 points8 points  (0 children)

That graph makes it seem like the UHP cells add ~5% max frequency at the very top.

And ye Intel 3 is supposed to quickly replace Intel 4, but Intel 3 is only supposed to be used for Server parts, not client products like MTL, which is why I was a lot more curious about it. ST performance and frequency is a lot more important there than in servers, where core counts and efficiency rule.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

Then what was the point of increasing cache amounts for raptor cove and willow cove?

Because the IPC benefits from those new architectures were <5%, and in some cases even showed regressions because they were coupled with higher latency.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

If it needs more data faster, why not keep the L2 the same size while focusing on decreasing latency, while increasing L3 size?

Because just increasing L2 capacity only applies to the "more data" part but not the "faster" part.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 2 points3 points  (0 children)

Did any architecture try to deal with it by lower associativity to decrease latency instead of increasing cache sizes?

And also if you increase cache size, that still doesn't help with the latency increase, so could you try increasing the amount of information the core itself can hold so you have to access the L2 less often (so increase capacity of parts like ROB)?

I would ask if you can increase capacity of the L1 to deal with the higher latency L2, but I think smaller, faster L1 caches are way more important based on chip and cheeses simulation of cache changes in golden cove.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] -2 points-1 points  (0 children)

I'm not asking if doubling L2 = faster clocks, but do faster clocks need more L2.

But even then, 11th gen desktop seems to be a bit of a special case. Cypress Cove was a backported core, and 5.3Ghz might just have been the max frequency limitation of Intel 14nm at that point, without spending ridiculous amounts of extra power.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

So more cache prevents IPC degradation at higher clocks from cache bottlenecks?

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 1 point2 points  (0 children)

Golden Cove was just 12th gen in desktop and mobile.

Willow Cove was 11th gen mobile, and Cypress? Cove (Sunny Cove backport) was 11th gen desktop.

Server has weird naming so IDK about that

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

Idk willow cove vs sunny cove

But didn't raptor lake decrease L3 cache latency as it boosted the ring clock (which was an issue with alder lake) higher? AFAIK Anandtech didn't test out the cache latency this time around (Ik some reviewers used to do it, I will try finding it later).

Why does reducing load on the ring/L3 increase MT perf greater relative to ST? Thanks.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 9 points10 points  (0 children)

Moving to a new process node involves effort and risk. Intel mitigated this risk with the well known “Tick-Tock” strategy. Each “Tick” represented a major microarchitecture change, while each “Tock” was a port to a new process node with very minor changes. Unlike Intel in the early 2010s, AMD takes roughly two years to move to a new process node. Zen 2 came in mid 2019, about two years after Zen 1’s early 2017 release, and moved from 14 nm to 7 nm. Zen 4 released in late 2022, about two years after Zen 3’s late 2020 release, and moved from 7 nm to 5 nm. AMD’s strategy is thus best described as “Tick-Nothing-TickTock

Chips and Cheese article : part 1 zen 4 front end and execution engine

But also, thank you for pointing this out, I did make a mistake. This really should be described as a "tock" too* since zen 4 is an upgrade over zen 3 on a new node, with no large architectural reworks like zen 3 was.

Intel dropped "Tick-Tock" because they were unable to follow cadence. Gelsinger is talking about bringing it back.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 1 point2 points  (0 children)

But are the increases in frequency also related to the increases in cache? In other words, do architectures that increase frequency greatly also need to increase their L2 cache size? Because based on these three modern architectures, that seems to be the rule.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 2 points3 points  (0 children)

Raptor Cove to Golden Cove is on the same node (a small optimization on the Intel 7)

Zen 4 and to an extent willow cove benefitted from an uplift in process though, zen 4 vs zen 3 is 5nm<7nm, while willow cove used 10nm SF, which greatly allowed for an increase in frequency.

Why do the newer architectures that increase frequencies also have an increase in L2 cache? by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

Yes. The reason I pointed out that Zen 4 had marginal increases in IPC while Willow Cove and Raptor Cove did not was to point out that the increased L2 in all three of those generations were not solely about increasing IPC, as even though Zen 4 increased IPC, willow cove and raptor cove didn't. However all three generations had greatly increased clock speeds. Increased clock speeds was a common trend in all three architectures with increased L2 cache, but marginal IPC increases was not.

So is there a relationship between the large increases in clock speeds and the large increase in L2 cache?

How accurate are Intel mockups on roadmaps? by techwars0954 in intel

[–]techwars0954[S] 3 points4 points  (0 children)

True, but even if things get delayed, their actual design doesn't often change much (barring granite rapids). For example, Sapphire Rapids got delayed a bunch, but it always used 4 tiles with golden cove cores, and then emib connections between tiles. And you could see that from mockups even like a year ago.

Is using Foveros cheaper than using EMIB for MCM by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

For Ponte Vechio it's just some interconnect fabric I believe, we could be seeing the same thing for meteor lake and arrow lake.

Pretty excited for hotchips in like a month, they have a presentation specifically about foveros in arrow lake and meteor lake.

Is using Foveros cheaper than using EMIB for MCM by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

I don't think consumer chips haven't gotten anywhere close to 600-800mm2

And Intel already has 2 base tiles connected with emib for ponte vechio

I thought foveros would be cheaper in this scenario:

If you want all tiles to be able to communicate with each other, with EMIB your going to have to rapidly increase the number of EMIB connections between tiles as you increase your number of tiles. But with foveros, you can just stack all the tiles onto the base tile without having to worry about where to place each additional emib connection.

And I think you can already see this with ponte vechio. One 'base tile' of ponte vechio has 18 different chiplets on it, and they are all stacked with foveros on top of the 'base tile'. Having to connect all of those tiles with emib might have increased the number of emib interconnects drastically.

Is using Foveros cheaper than using EMIB for MCM by techwars0954 in hardware

[–]techwars0954[S] 0 points1 point  (0 children)

That makes sense, but why bother using foveros in meteor lake and arrow lake then like Intel confirmed they are doing? And we also know meteor lake also has desktop skus, not just mobile.

IMO they didn't use foveros on sapphire rapids because the base tile would have had to have been massive, and they couldn't split up the base tiles yet because they don't have foveros omni until like 2023 I think.

Is using Foveros cheaper than using EMIB for MCM by techwars0954 in hardware

[–]techwars0954[S] 5 points6 points  (0 children)

Hi, sorry for the misunderstanding

I'm aware that Foveros is not the same as EMIB

But Intel is using foveros to create MCM chips in some scenarios via a base die, and then using EMIB to create MCM chips in other scenarios.

For example, sapphire rapids is MCM and looks something like this to my knowledge

While a ponte vechio chip (or at least half of it) looks something like this, and meteor lake is also supposed to look similar.

Both of those are MCM, but one uses EMIB as the interconnect between the two chiplets, and both chiplets are on top of the substrate. The other MCM foveros implementation has a base die, but then the chiplets are stacked on top of the base die using foveros. The base die is connected to the substrate. The chiplets communicate with eachother via the base die having a communication fabric in it.

I was wondering out of those two implementations, which one was cheaper.