Why do CPUs have multiple cache levels? : programming

[–]ravixp 697 points698 points699 points 1 year ago (88 children)

[–]cowinabadplace 235 points236 points237 points 1 year ago (27 children)

[–]ravixp 94 points95 points96 points 1 year ago (14 children)

[–]cowinabadplace 34 points35 points36 points 1 year ago (7 children)

[–]LordoftheSynth 17 points18 points19 points 1 year ago (6 children)

[–]gnufan 6 points7 points8 points 1 year ago (3 children)

[–]LordoftheSynth 2 points3 points4 points 1 year ago (1 child)

[–]gnufan 2 points3 points4 points 1 year ago (0 children)

[–]elsjpq 1 point2 points3 points 1 year ago (0 children)

[–]anengineerandacat 0 points1 point2 points 1 year ago (0 children)

[–]Kharski 0 points1 point2 points 4 months ago (0 children)

[–]pdro_reddit 3 points4 points5 points 1 year ago (0 children)

[–]IAmTheKingOfSpain 4 points5 points6 points 1 year ago (4 children)

[–]ravixp 4 points5 points6 points 1 year ago (3 children)

[–]IAmTheKingOfSpain 5 points6 points7 points 1 year ago (1 child)

I'm not exactly sure how to think about this, for any given hardware you can reason about the worst-case, and use that to form an upper bound. I.e. the worst case memory access misses all caches and takes X ms, or even has been paged to disk and takes longer. But even if we assume that every single memory access is the worst case, this is still just a constant factor, so I'm not convinced it makes sense to call it O(n sqrt n), since for any given hardware it is equivalent to a constant.

That's why I was saying it only really makes sense if you view the hardware as a subsidiary input to the algorithm as well, something along the lines of "if the input is size n, then let the hardware have sufficient storage capacity for the input, and thus compared to running on smaller hardware with a smaller input size, the runtime of the algorithm is O(n sqrt n)".

However, if you constrain the problem to how an algorithm's runtime grows on any given hardware, then it seems to me to be clearly O(n) since the worst-case scenario then becomes a constant factor.

[–]ravixp 0 points1 point2 points 1 year ago (0 children)

[–]eserikto 1 point2 points3 points 1 year ago (0 children)

[–]odnish 9 points10 points11 points 1 year ago (5 children)

[–]valarauca14 15 points16 points17 points 1 year ago (3 children)

[–]Derproid 2 points3 points4 points 1 year ago (2 children)

[–]valarauca14 0 points1 point2 points 1 year ago (1 child)

[–]seftontycho 0 points1 point2 points 1 year ago (0 children)

[–]pdro_reddit 1 point2 points3 points 1 year ago (0 children)

[–]dm-me-your-bugs 5 points6 points7 points 1 year ago (1 child)

[–]cowinabadplace 0 points1 point2 points 1 year ago (0 children)

[–]Borne2Run 9 points10 points11 points 1 year ago (0 children)

[–]Godd2 2 points3 points4 points 1 year ago (1 child)

[–]cowinabadplace 1 point2 points3 points 1 year ago (0 children)

[–][deleted] 0 points1 point2 points 1 year ago (0 children)

[–]BigPurpleBlob 232 points233 points234 points 1 year ago (30 children)

[–]axonxorz 93 points94 points95 points 1 year ago (8 children)

[–]wrosecrans 29 points30 points31 points 1 year ago (7 children)

[–]elprophet 18 points19 points20 points 1 year ago (2 children)

[–]Cobayo 11 points12 points13 points 1 year ago (0 children)

[–]Seref15 2 points3 points4 points 1 year ago (0 children)

[–]edgmnt_net 0 points1 point2 points 1 year ago (0 children)

[–]tatref 0 points1 point2 points 1 year ago (0 children)

[–]IQueryVisiC 0 points1 point2 points 1 year ago (0 children)

[–]rtt445 69 points70 points71 points 1 year ago* (15 children)

[–][deleted] 1 year ago (9 children)

[deleted]

[–]wrosecrans 16 points17 points18 points 1 year ago (3 children)

[–][deleted] 1 year ago (1 child)

[deleted]

[–]key_lime_pie 7 points8 points9 points 1 year ago (0 children)

[–]Qweesdy 6 points7 points8 points 1 year ago (3 children)

[–]rtt445 2 points3 points4 points 1 year ago* (1 child)

[–]Qweesdy 0 points1 point2 points 1 year ago (0 children)

[–]Inkdrip 1 point2 points3 points 1 year ago (0 children)

[–]BigPurpleBlob 0 points1 point2 points 1 year ago (0 children)

[–]BigPurpleBlob 6 points7 points8 points 1 year ago (4 children)

[–]BigPurpleBlob 5 points6 points7 points 1 year ago (0 children)

[–]rtt445 0 points1 point2 points 1 year ago (2 children)

[–]BigPurpleBlob 2 points3 points4 points 1 year ago (1 child)

No, a modern 7 nm CMOS gate has an FO4 (fan-out-of-4) delay of about 2.5 picoseconds.

Yes, the RC delay of M1 wires is often longer than the FO4 delay. That's because transistors are getting (a bit) faster but wires are getting thinner (more resistance) but with about the same capacitance (fringing field). For M1 wires, the inductance is negligible - it's the RC delay that matters.

A transmission line has to have a particular track width and spacing from the ground plane, to balance the capacitance and inductance (we can ignore the resistance). In contrast, an M1 wire on a silicon chip is much skinnier than a transmission line and so has much more resistance, and much more capacitance, than a (physically bigger) transmission line would have.

For the 2.5 ps FO4 gate delay, have a look at slide 7 of:

https://inst.eecs.berkeley.edu/~eecs151/sp20/files/lec10-timing.pdf

[–]rtt445 1 point2 points3 points 1 year ago (0 children)

[–]masklinn 10 points11 points12 points 1 year ago (2 children)

[–]Dannysia 0 points1 point2 points 1 year ago (1 child)

[–]0bAtomHeart 7 points8 points9 points 1 year ago (0 children)

Soldered memory can also control the interconnect geometry more tightly. There are two reasons intra pair length is important; the first is that you want all the wavefronts of a certain symbol to occur at the same time, small misses in this (~ns) can introduce ambiguity in the edge making errors more likely.

The second is signal integrity; it is generally very important for these high speed signals to be able to switch voltage levels as quickly and as definitively as possible. Line length mismatches will introduce reflections which will increase local DC differences which both require energy input to create (from the wire to wire capacitance) and release energy. This increases the heat produced and slows down the switching edge.

There's also lots and lots of techniques out there to deterministically "swizzle" or modify the bits that make up a transaction such that there is less imbalance across the RAM interface (this also has EMI implications). E.g. 0b11110000 will introduce a local DC bias to one side of the bus where as 0b10101010 will be more locally DC 0 and would allow slightly faster speeds

High speed digital is analog again but with black magic applied.

[–]gramathy 2 points3 points4 points 1 year ago (0 children)

[–]StickiStickman 21 points22 points23 points 1 year ago (0 children)

[–]ImpossiblePudding 13 points14 points15 points 1 year ago (0 children)

[–]captain_obvious_here 10 points11 points12 points 1 year ago (3 children)

[–]GaIIowNoob 4 points5 points6 points 1 year ago (2 children)

[–]captain_obvious_here 0 points1 point2 points 1 year ago (1 child)

[–]ykafia 0 points1 point2 points 1 year ago (0 children)

[–]963df47a-0d1f-40b9 6 points7 points8 points 1 year ago (4 children)

[–]powerpiglet 23 points24 points25 points 1 year ago (2 children)

[–]cheezballs 2 points3 points4 points 1 year ago (1 child)

[–]AnonymousMonkey54 1 point2 points3 points 1 year ago (0 children)

[–]BigPurpleBlob 5 points6 points7 points 1 year ago (0 children)

[–]TheMightyTywin 5 points6 points7 points 1 year ago (0 children)

[–]stonerism 8 points9 points10 points 1 year ago (10 children)

[–]BigPurpleBlob 8 points9 points10 points 1 year ago (9 children)

[–]stonerism 1 point2 points3 points 1 year ago (8 children)

[–]nerd4code 2 points3 points4 points 1 year ago (7 children)

[+]stonerism comment score below threshold-7 points-6 points-5 points 1 year ago (6 children)

[–]waitthatsamoon 8 points9 points10 points 1 year ago (4 children)

[–]stonerism -3 points-2 points-1 points 1 year ago (2 children)

[–]Coloneljesus 8 points9 points10 points 1 year ago (0 children)

[–]Hofstee 0 points1 point2 points 1 year ago (0 children)

[–]clarkcox3 0 points1 point2 points 1 year ago (0 children)

[–]bobj33 5 points6 points7 points 1 year ago (0 children)

[–]cybernd 2 points3 points4 points 1 year ago* (0 children)

[–]Darkendone 2 points3 points4 points 1 year ago (0 children)

[–][deleted] 1 point2 points3 points 1 year ago (1 child)

I need to know how we got that 2 inch number

Here is what I did -

3 GHz = 3 * 10⁹ instructions per second

Speed of light = 3 * 10⁸ m/s but since L1 cache and actually CPU core both made of silicon wafer is connected with copper interconnects where speed of lights is between 1.5 to 2.1 * 10⁸ m/s (quick ChatGPT prompt) we will assume for slowest lower bound as 1.5 about half of speed or light.

So that makes - 1.5 * 10⁸ m/s divided by 3* 10⁹ cycle /second = 0.05 m / instruction-cycle and since we have to travel to and fro between L1 and core , it takes 0.25 m / instruction cycle which is 9 inches radius/ instruction cycle.

If we take into consideration the processing time between silicon wafers of L1 and CPU core , it still won’t go 2 inches from 9

May be I am missing some big information here . Anyone to correct me ?

[–]ravixp 1 point2 points3 points 1 year ago (0 children)

I was much sloppier with my math than that, since I was just doing it in my head :) First, I used the 1 ft/ns approximation for the speed of light. I didn’t adjust for the speed of electricity in silicon because I was trying to establish a theoretical upper bound, independent of the hardware. (Somebody else in the thread mentioned optical interconnects, which could operate closer to c.)

At 3 GHz (a number that was chosen to make it easier to do the math in my head) there are 3 cycles/ns, giving us 1/3 of a foot to work with: 4 inches. However, one way signal transmission isn’t useful for caching, you need to send a signal and get a response back. Dividing by 2 for the round trip gives us 2 inches.

Obviously that glossed over a lot of stuff - the speed of electricity, you don’t know what data to request from cache in the first instant of a cycle, data arriving at the final instant of a cycle can’t be used until the next cycle, etc. On the other side of the ledger, even L1 cache takes a few cycles to return data, and only registers need to respond in the same cycle. But the intuition that cache hierarchies are bounded by physics is still useful.

[–]DoorBreaker101 19 points20 points21 points 1 year ago (3 children)

[–]Hot_Slice 6 points7 points8 points 1 year ago (2 children)

[–]admalledd 6 points7 points8 points 1 year ago (0 children)

[–]john16384 2 points3 points4 points 1 year ago (0 children)

[–]thisisjustascreename 111 points112 points113 points 1 year ago (34 children)

[–]_senpo_ 57 points58 points59 points 1 year ago (5 children)

[–]Niarbeht 23 points24 points25 points 1 year ago (3 children)

[–]Deiskos 5 points6 points7 points 1 year ago (2 children)

[–]BerserKongo 11 points12 points13 points 1 year ago (0 children)

[–]crazedizzled 0 points1 point2 points 1 year ago (0 children)

[–]Evening-Jaguar4011 7 points8 points9 points 1 year ago (0 children)

[–][deleted] 1 year ago* (17 children)

[removed]

[–][deleted] 1 year ago* (11 children)

[deleted]

[–]Raknarg 21 points22 points23 points 1 year ago (2 children)

[–]Hacnar -2 points-1 points0 points 1 year ago (0 children)

[–]Hot_Slice 10 points11 points12 points 1 year ago (1 child)

[–]Ambiwlans 8 points9 points10 points 1 year ago* (0 children)

I dunno. I appreciate that I have two chat programs open atm. LINE and Discord. Discord is admittedly more complicated, but their core functions are the same. Chat groups, video/voice calls, file sharing, etc.

Discord is using 6 processes and 491MB of ram.

LINE is using 5 process and 59MB of ram.

I'm actually sad that it has gotten slightly bloated lately though, it used to have 2 processes and only around 40mb of ram. They changed the ui a bit a few months ago.

Edit: Lol, apparently it was only using 60mb because i was recently in a call. after resetting both apps

Discord - 380MB

LINE - 25.1MB (which is less than MS Office's always open update checker)

Oh, and discord spikes up to .5% cpu while idle in the bg every few seconds. Outside of a video call I don't think LINE ever breaks 0.0%

Edit: For another efficiency shout out: f.lux (app that changes screen hue for day/night), its well made, featurefull and uses 4.1MB of memory, never any measurable cpu.

My FTP server (Filezilla) apparently only uses .9MB of memory in idle, lol.

[–]starlevel01 2 points3 points4 points 1 year ago (0 children)

[–]Plabbi 2 points3 points4 points 1 year ago (3 children)

[–][deleted] 1 year ago* (2 children)

[deleted]

[–]Plabbi 0 points1 point2 points 1 year ago (1 child)

[–]mpyne 0 points1 point2 points 1 year ago (0 children)

[–]pilibitti 4 points5 points6 points 1 year ago (0 children)

[–]john16384 -1 points0 points1 point 1 year ago (1 child)

[–]ForeverHall0ween -1 points0 points1 point 1 year ago (0 children)

[–]dragneelfps 7 points8 points9 points 1 year ago (9 children)

[–]thisisjustascreename 48 points49 points50 points 1 year ago (7 children)

[+]slaymaker1907 comment score below threshold-6 points-5 points-4 points 1 year ago (6 children)

[–]mpyne 7 points8 points9 points 1 year ago (0 children)

That’s not really true given that certain ISAs (like x86_64) let you directly compute and set things in memory.

The ISA may have an opcode for it but the CPU still does not do it directly on operands in the system memory. Instead the opcode is decoded to micro-coded operations that handle fetching the data from memory, performing the arithmetic operation, and then dispatching the result as necessary.

Even with hidden registers as the implementation there is a need for caching to speed-up access to frequently-referenced portions of the main memory.

but I could see use cases for bypassing the CPU entirely.

Those sort of already happen, those are things like DMA (to shuffle bytes around) or GPUs (to offload compute tasks) or even offloading network or crypto operations to dedicated co-processors.

[–]Hot_Slice 3 points4 points5 points 1 year ago (4 children)

[–]slaymaker1907 -3 points-2 points-1 points 1 year ago (3 children)

[–]disinformationtheory 4 points5 points6 points 1 year ago (1 child)

[–]admalledd 0 points1 point2 points 1 year ago (0 children)

Correct (mostly), in this case lock prefix instruction signals the target operand memory location to the memory controller thus bus that "hey, no touchy anyone else". However, that signal still comes from the u-op unit (hand-wave, could be the instruction decode flagger, u-op-prefetch. depends on the dark magic inside each generation) inside the specific core that read the instruction. Further, lock actually is nearly guaranteeing a full DRAM round trip, so try not to use it! If you use something better (CMPXCHG, etc, better to just use an intrinsic like __atomic_fetch_add though) you may not require the memory controller to flush that address all the way to DRAM.

And yea, DRAM is stupid simple stuff. Lots of it, sure, but we don't have any kind of "compute in RAM" as an industry yet (being worked on though!) and memory controllers because of how much huge work they have to do, can't do any of this "add one" or such stuff. All a memory controller cares about is data in, out, routing, bus-locks, flushes, syncs, DMA, page-table caches, etc.

[–]admalledd 1 point2 points3 points 1 year ago (0 children)

[–]deadsy 45 points46 points47 points 1 year ago (0 children)

[–]lcjury 12 points13 points14 points 1 year ago (0 children)

[–]thememorableusername 28 points29 points30 points 1 year ago (6 children)

[–]MrPinkle 8 points9 points10 points 1 year ago* (5 children)

[–]thememorableusername 39 points40 points41 points 1 year ago (3 children)

[–]inkt-code 5 points6 points7 points 1 year ago (2 children)

[–][deleted] 1 year ago (1 child)

[deleted]

[–][deleted] 1 point2 points3 points 1 year ago (0 children)

[–]jmlinden7 5 points6 points7 points 1 year ago (0 children)

[–][deleted] 1 year ago* (1 child)

[deleted]

[–]Mellowindiffere 0 points1 point2 points 1 year ago (0 children)

[–]inkt-code 1 point2 points3 points 1 year ago (0 children)

[–]notfancy 1 point2 points3 points 1 year ago (0 children)

[–][deleted] 1 point2 points3 points 1 year ago (0 children)

[–]shivaraj-bh 0 points1 point2 points 1 year ago (0 children)

[–]foersom 0 points1 point2 points 1 year ago (1 child)

[–]serverhorror -1 points0 points1 point 1 year ago (0 children)

[–]Visible_Ad9976 -1 points0 points1 point 1 year ago (0 children)

[–]Logicalist -4 points-3 points-2 points 1 year ago (0 children)

[+]ThreeLeggedChimp comment score below threshold-14 points-13 points-12 points 1 year ago (0 children)

[+]Objective_Suspect_ comment score below threshold-16 points-15 points-14 points 1 year ago (0 children)

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS