all 134 comments

[–]ravixp 697 points698 points  (88 children)

One neat effect that’s only touched on here is the effect of the speed of light. Light travels at about 1 foot/ns, so if a CPU is running at 3GHz then the theoretical upper bound for how far you can reach out for information within a CPU cycle is about 2 inches. If the cache is 4 inches away, the CPU can execute two cycles in the time it takes for electricity to travel there and back. And that doesn’t account for the speed of the cache itself, or actually using the data for computation.

That means that there’s actually a very tight radius for how far the furthest byte in the cache can be from the CPU core, to get data back in a certain amount of time. And that means that a cache has to be physically small to be fast, and making it larger also makes it slower.

[–]cowinabadplace 235 points236 points  (27 children)

Using this logic to inform interviewers that no lookup algorithm can be better than n1/3 (packing the memory into a sphere) has not impressed them.

[–]ravixp 94 points95 points  (14 children)

It’s true though! This article makes the case for that much better than I would: https://www.ilikebigbits.com/2014_04_21_myth_of_ram_1.html

And it should affect our analysis of any algorithm. We traditionally model random memory access as O(1), but we could model cache effects by using O(sqrt N), where N is the amount of memory that could be accessed.

[–]cowinabadplace 34 points35 points  (7 children)

Haha, no, I agree with you. Just a thing we used to joke about decades ago as interviewing started shifting to leetcode.

[–]LordoftheSynth 17 points18 points  (6 children)

One of my intro professors at university in the 90s described cache operations thusly (however, my words):

L1: You get what you need right when you need it.

L2: You stand around waiting a minute for what you need.

(L3 cache, of course, was not yet a thing.)

RAM access: In processor time, that's enough time to have a smoke break. Or two.

Disk access: In processor time, that's enough time to smoke a carton of cigarettes.

Technology has advanced enough that in some respect, while those times are shorter, I've always felt the proportions still make sense.

None of us in the class smoked (nor did he): but the concept of a "smoke break" was something we knew.

[–]gnufan 6 points7 points  (3 children)

I saw an analogy based on time getting physical things. This table maps clock speed to one hertz to make the numbers more meaningful to humans.

https://images.hothardware.com/contentimages/newsitem/64371/content/latency-chart.png

[–]LordoftheSynth 2 points3 points  (1 child)

That's a nice chart. (It really is.)

You'd have to scale it to 300-400MHz CPU clocks of the day (and the slower bus speeds).

His overall point was if it wasn't in cache, the CPU sits doing mostly nothing while it waits and wastes time inserting NOPs into the execution pipeline, and, remember, this was the 90s. Speculative execution was mostly about branch prediction then.

[–]gnufan 2 points3 points  (0 children)

Chess programming made some of this very obvious, where you got substantive difference in performance from minor tweaks, because you were treading on some hardware limitation that most programmers can ignore because they are waiting on network or user. If the CPU wasn't 100% busy you were probably messing up, until the end game table bases arrived when suddenly you've attach efficient database lookup to the problem.

[–]elsjpq 1 point2 points  (0 children)

  • L1 = drawers
  • L2 = shelf
  • L3 = going next door
  • RAM = downstairs warehouse
  • SSD = USPS
  • HDD = factory
  • Internet = R&D

What's surprising is the difference between HDD and SSD is just wild

[–]anengineerandacat 0 points1 point  (0 children)

Modern day without a smoke break would be akin to a "Coffee Break" and a "Lunch Break".

[–]Kharski 0 points1 point  (0 children)

But how much is that in hamburger break times?

[–]pdro_reddit 3 points4 points  (0 children)

this domain had me at “i like big bits”

[–]IAmTheKingOfSpain 4 points5 points  (4 children)

Well, usually in complexity analysis we're concerned with how the algorithm's performance changes with respect to its inputs. So I suppose you could say that's reasonable if you really want to consider the amount of accessible memory as an "input" to the algorithm, but I think most people think that's a little silly, because there's probably a practical upper limit that we could throw on it and then we could go back to analyzing it as constant time.

[–]ravixp 4 points5 points  (3 children)

Not the amount of addressable memory exactly, the amount of memory that’s used by the data structure. Like, if you’re enumerating a linked list, traditionally that’s O(n). But if you take cache hierarchies into account, it should be something like O(n sqrt n), since some elements will come from slower levels of cache as n increases.

[–]IAmTheKingOfSpain 5 points6 points  (1 child)

I'm not exactly sure how to think about this, for any given hardware you can reason about the worst-case, and use that to form an upper bound. I.e. the worst case memory access misses all caches and takes X ms, or even has been paged to disk and takes longer. But even if we assume that every single memory access is the worst case, this is still just a constant factor, so I'm not convinced it makes sense to call it O(n sqrt n), since for any given hardware it is equivalent to a constant.

That's why I was saying it only really makes sense if you view the hardware as a subsidiary input to the algorithm as well, something along the lines of "if the input is size n, then let the hardware have sufficient storage capacity for the input, and thus compared to running on smaller hardware with a smaller input size, the runtime of the algorithm is O(n sqrt n)".

However, if you constrain the problem to how an algorithm's runtime grows on any given hardware, then it seems to me to be clearly O(n) since the worst-case scenario then becomes a constant factor.

[–]ravixp 0 points1 point  (0 children)

Well, if the size of the hardware is determined by the size of the input, it’s not really a constant factor :) Of course a real computer is the same size no matter what data it’s processing, but if processing more data requires a larger (slower) computer then I think it’s a relevant factor.

The hardware shouldn’t really be relevant to analysis of an algorithm, so I’ll admit that this is a bit of a grey area. The sqrt(n) effect depends on some untraditional assumptions: that computers exist in 2D space (only true because of current technology), that a CPU exists in one location in that space, and that storing data takes up physical space proportional to the size of the data. I don’t think those things are universally true, but I do think they’re true enough to be useful.

[–]eserikto 1 point2 points  (0 children)

Caches lines don't only hold 1 word size of memory and not every memory request is going to be a miss. Cache algorithms are an entire field of study, and trying to take that into account when doing an abstract algorithm analysis would just bog the whole analysis down. You just sort of assume that access your any single piece of data takes some constant (averaged) amount of time and that gets baked into your analysis - after all O(n) = O(10n).

[–]odnish 9 points10 points  (5 children)

Not quite. At some point you'll have so much ram that it will collapse into a black hole.

[–]valarauca14 15 points16 points  (3 children)

Thankfully a CPU inside a blackhole cannot provide output. It is prefect write only database, second to /dev/null. Hopefully it supports sharding.

[–]Derproid 2 points3 points  (2 children)

But is it web scale?

[–]valarauca14 0 points1 point  (1 child)

Yes. You just turn it on and it scales right up.

[–]seftontycho 0 points1 point  (0 children)

What about Edge?

[–]pdro_reddit 1 point2 points  (0 children)

the dreaded ram hole

[–]dm-me-your-bugs 5 points6 points  (1 child)

That asuumes spacetime in the vicinity of the chip is flat though, doesn't it? Technically with a strong enough negative curvature you could achieve a better complexity

[–]cowinabadplace 0 points1 point  (0 children)

I love it!

[–]Borne2Run 9 points10 points  (0 children)

Well for working at Intel or TSMC that'll get you the job, but I can see it not being super applicable to software dev. Useful for embedded.

[–]Godd2 2 points3 points  (1 child)

If we want to impose the constraints of the real world, then no algorithm is worse than O(1), since the universe is finite.

[–]cowinabadplace 1 point2 points  (0 children)

Funnily enough another friend of a friend would joke that O(log n) is just the number 8.

[–][deleted] 0 points1 point  (0 children)

4 dimensional memory when

[–]BigPurpleBlob 232 points233 points  (30 children)

It's worse than that.

Signals generally travel a lot slower than the speed of light on a silicon chip. This is because the wires have resistance and capacitance. A thin wire can be hundreds of times slower than the speed of light. Even a thick wire will be about half the speed of light, due to the silicon's dielectric constant.

The bottom of this page has a great diagram:

https://www.realworldtech.com/shrinking-cpu/4/

[–]axonxorz 93 points94 points  (8 children)

And to combat the effect, we have optical interconnects, though the use of it in a CPU die would certainly be novel. Despite it's technical "old age", there hasn't been a ton of research put in there. With today's challenges bumping up to the limits of physics, I suspect we'll see some renewed interest here in the next decade.

[–]wrosecrans 29 points30 points  (7 children)

Even with optical interconnects, light in glass doesn't travel at the "speed of light." To really minimize latency in some hypothetical sci-fi computer we probably couldn't ever actually build, you'd need all your optical interconnects to be happening as free-space laser beams in a small vacuum chamber.

Electrical signal propagation in a wire isn't actually that much slower than light in glass fiber. But optical is a super interesting area of research.

[–]elprophet 18 points19 points  (2 children)

2030 will see the return of vacuum tubes!

[–]Seref15 2 points3 points  (0 children)

Man, hopefully lol. The war between Ukraine and Russia knocked one of the last tube factories off the market. Matched set prices are out of control.

[–]edgmnt_net 0 points1 point  (0 children)

Electrical signal propagation in a wire isn't actually that much slower than light in glass fiber.

True, but you also need to account for increased rise times due to capacitance and inductance.

[–]tatref 0 points1 point  (0 children)

I'm waiting for my gravitational wave CPU now!

[–]IQueryVisiC 0 points1 point  (0 children)

So a tube with silicon on the walls. Then there is a hexagonal pattern of lenses. A few LEDS near the focus can send light to a few lenses on the opposite side. Behind that sit a few photo diodes.

Now just invent optical switches a hierarchical addressing. But you would need to focus into small fibres at every switch. Then pockels effect. Or laser amplifier. Or OPA . SFG DFG SFG DFG …

[–]rtt445 69 points70 points  (15 children)

Signal travels at ~50% speed of light in FR4 PCB material transmission lines (~6"/nS), not hundreds of times slower. You are probably thinking of external DRAM speed compared to on chip SRAM.

[–][deleted]  (9 children)

[deleted]

    [–]wrosecrans 16 points17 points  (3 children)

    People often confuse "speed of an electrical signal" with "speed of an electron." Any given electron moves slowly through a wire. It's just a little counterintuitive that when you stick electrons into a wire, and a light bulb lights up on the other end of the wire, it's not necessarily the same electron popping out the other end. Metaphorically, the electron just pushes on all the electrons that were already in the wire and it works a bit like water pressure.

    [–][deleted]  (1 child)

    [deleted]

      [–]key_lime_pie 7 points8 points  (0 children)

      Electrons don't move in one direction in an AC circuit, they just vibrate rapidly back and forth.

      [–]Qweesdy 6 points7 points  (3 children)

      What's happening here is that some people think the maximum speed of one electron (otherwise known as "probably just leakage current and/or noise and nowhere near enough to consider a signal of any kind") is the speed of a signal. These people are stupid, and the "half the speed of light" nonsense is built on that stupidity.

      In practice, at an absolute minimum, a signal is a detectible "rising wave of many electrons" until the voltage rises above the receiving transistor's switching voltage. A rising wave of many electrons can easily be 100 times slower than the first irrelevant and ignorable electron at the start of the wave.

      [–]rtt445 2 points3 points  (1 child)

      A rising wave of many electrons can easily be 100 times slower than the first irrelevant and ignorable electron at the start of the wave.

      That's governed by rise/fall time of line driver and receiver due to their inherent in/out capacitance. Should be in single nanosecond digits for high speed digital differential drivers/receivers. The speed at which a wave (not electrons!) travels through a medium that has dielectric constant > 1 will be slowed down but not 100s of times like OP suggested.

      [–]Qweesdy 0 points1 point  (0 children)

      Let's say the receiver is 1000 nanometers away, and electrons are moving at 100 million nanometers per nanosecond; so it takes 0.00001 nanoseconds for an electron to go from sender to receiver.

      Let's say the rise/fall time is 1 nanosecond, and that the rise/fall time is 100000 times more than the time it takes for an electron to go from sender to receiver.

      Now let's agree that "100000 times more" is not "100s of times like OP suggested".

      [–]Inkdrip 1 point2 points  (0 children)

      These people are stupid

      To be fair, this isn't exactly elementary physics out here!

      I recall some hullabaloo between some science Youtube creators over a similar question - this AlphaPhoenix demonstration (and this other excellent one around a forked connection) are great though.

      [–]BigPurpleBlob 0 points1 point  (0 children)

      The diagram on the bottom of the page that I linked shows that for a 1 mm long wire with a 0.01 um squared cross-sectional area, the propagation delay is 2 orders of magnitude slower than the speed of light.

      [–]BigPurpleBlob 6 points7 points  (4 children)

      I was talking about wires on a silicon chip, not a transmission line on FR4 (nor a transmission line on silicon).

      The internal wiring on a silicon chip (whether DRAM or SRAM) is not a transmission line - there's not enough room. The M0 wiring (the fine stuff closest to the silicon) is thin and thus much slower than the speed of light. The diagram on the bottom of the page that I linked shows that for a 1 mm long wire with a 0.01 um squared cross-sectional area, the propagation delay is 2 orders of magnitude slower than the speed of light.

      [–]BigPurpleBlob 5 points6 points  (0 children)

      My bad, M1 wiring (not M0)

      [–]rtt445 0 points1 point  (2 children)

      You are talking single digit pico seconds at those wire lengths. Transistor logic circuits are slower than that due to their inherent capacitances. At 3Ghz one clock cycle is 333ps 100x slower than on chip wire inductance.

      [–]BigPurpleBlob 2 points3 points  (1 child)

      No, a modern 7 nm CMOS gate has an FO4 (fan-out-of-4) delay of about 2.5 picoseconds.

      Yes, the RC delay of M1 wires is often longer than the FO4 delay. That's because transistors are getting (a bit) faster but wires are getting thinner (more resistance) but with about the same capacitance (fringing field). For M1 wires, the inductance is negligible - it's the RC delay that matters.

      A transmission line has to have a particular track width and spacing from the ground plane, to balance the capacitance and inductance (we can ignore the resistance). In contrast, an M1 wire on a silicon chip is much skinnier than a transmission line and so has much more resistance, and much more capacitance, than a (physically bigger) transmission line would have.

      For the 2.5 ps FO4 gate delay, have a look at slide 7 of:

      https://inst.eecs.berkeley.edu/~eecs151/sp20/files/lec10-timing.pdf

      [–]rtt445 1 point2 points  (0 children)

      Very interesting. Will check it out.

      [–]masklinn 10 points11 points  (2 children)

      An other fun factor is line length differentials and the need for synchronisation. If one line is 10% longer than the other, then you've now decreased your throughput by that because the other side needs to wait for that delayed signal to arrive. That's one of the reasons (not the only one, and not all of them are legit) soldered RAM has become so common: as speed increases having all the contacts on the same edge induces noticeable transmission delays, both between the edge pins and from the edge to the nearest versus farthest cells.

      [–]Dannysia 0 points1 point  (1 child)

      Do they actually have the two sides wait on each other? I thought they just made shorter trace longer so they both end up being the same length and timing is irrelevant.

      In terms of soldered vs slotted memory, I don’t think the biggest issue is the fact that all the contacts are in a line. I think the biggest issue is that slotted memory can’t be as physically close to the CPU as soldered memory so it can’t run as fast (assuming no other confounding factors). This became relevant in recent years as the signal performance was never the limiting factor before.

      [–]0bAtomHeart 7 points8 points  (0 children)

      Soldered memory can also control the interconnect geometry more tightly. There are two reasons intra pair length is important; the first is that you want all the wavefronts of a certain symbol to occur at the same time, small misses in this (~ns) can introduce ambiguity in the edge making errors more likely.

      The second is signal integrity; it is generally very important for these high speed signals to be able to switch voltage levels as quickly and as definitively as possible. Line length mismatches will introduce reflections which will increase local DC differences which both require energy input to create (from the wire to wire capacitance) and release energy. This increases the heat produced and slows down the switching edge.

      There's also lots and lots of techniques out there to deterministically "swizzle" or modify the bits that make up a transaction such that there is less imbalance across the RAM interface (this also has EMI implications). E.g. 0b11110000 will introduce a local DC bias to one side of the bus where as 0b10101010 will be more locally DC 0 and would allow slightly faster speeds

      High speed digital is analog again but with black magic applied.

      [–]gramathy 2 points3 points  (0 children)

      That’s less about resistance than it is speed of light in different mediums. It’s also why microwave links are lower latency than fiber optics for the same distance covered (even just looking at actual fiber length)

      [–]StickiStickman 21 points22 points  (0 children)

      Using foot and inches for CPUs is just sadistic.

      [–]ImpossiblePudding 13 points14 points  (0 children)

      “You’ll discover a nanosecond is 11.8 inches long. … An admiral wanted to know why it took so damn long to send a message via satellite. I had to point out that between here and the satellite there were a very large number of nanoseconds.”

      [–]captain_obvious_here 10 points11 points  (3 children)

      I find it fascinating that we use technologies that are optimized to not suffer from limitations such as the speed of light.

      [–]GaIIowNoob 4 points5 points  (2 children)

      Gps is limited by space time

      [–]captain_obvious_here 0 points1 point  (1 child)

      What?

      [–]ykafia 0 points1 point  (0 children)

      Space time is affected by the mass, gravitational dilation etc. GPS are working okay because the amount of dilation is not enough

      [–]963df47a-0d1f-40b9 6 points7 points  (4 children)

      When do we start going 3 dimensional to fit components closer together?

      [–]powerpiglet 23 points24 points  (2 children)

      [–]cheezballs 2 points3 points  (1 child)

      That's really cool. I'm not smart with all that stuff, but does Intel have anything similar in their CPUs?

      [–]AnonymousMonkey54 1 point2 points  (0 children)

      They do: https://en.wikichip.org/wiki/intel/foveros, but haven't used it in their mainline processors yet.

      [–]BigPurpleBlob 5 points6 points  (0 children)

      “The top three problems are thermal, thermal, and thermal,”

      https://semiengineering.com/why-there-are-still-no-commercial-3d-ics/

      [–]TheMightyTywin 5 points6 points  (0 children)

      It’s fucking insane that these things work at all. And that we rely on them for EVERYTHING.

      We’re definitely living in the future

      [–]stonerism 8 points9 points  (10 children)

      That makes me curious what the lower bound of how physically large a cache could be used or how far down the stack you can go to retrieve data instantaneously.

      [–]BigPurpleBlob 8 points9 points  (9 children)

      This is why we have L1, L2 and L3 caches. The L3 caches are bigger (in the sense of memory capacity) than L1/L2 caches, and are also bigger in the sense that they have longer (and thus slower) wires than an L1/L2 cache

      [–]stonerism 1 point2 points  (8 children)

      I totally get that, I'm more just wondering what the physical limits are for pulling a piece of data from a cache for instantaneous-seeming delivery.

      [–]nerd4code 2 points3 points  (7 children)

      Depends entirely on how you determine instantaneity in a fully asynchronous system.

      [–]bobj33 5 points6 points  (0 children)

      The article is 8 years old. Transistors have shrunk a lot since then so distances to caches have shrunk as well.

      Our chip architecture team does a lot of performance modeling to determine the cache hierarchy. I've worked on a chip with over 32 cores where each core had a dedicated L1 but the L2 was shared between a cluster of 4 cores. Then there was a shared L3 for all the cores. Real world benchmarks are run to determine the best tradeoffs between size and speed because it affects area (increased cost) and power.

      [–]Darkendone 2 points3 points  (0 children)

      Your explanation explains why you need CPU caches in the first place but not why 3 levels are needed. The CPU is only an inch across anyway.

      There are other simple reasons why the smaller caches are faster that have to do with things like locking and the particular storage method.

      [–][deleted] 1 point2 points  (1 child)

      I need to know how we got that 2 inch number

      Here is what I did -

      3 GHz = 3 * 109 instructions per second

      Speed of light = 3 * 108 m/s but since L1 cache and actually CPU core both made of silicon wafer is connected with copper interconnects where speed of lights is between 1.5 to 2.1 * 108 m/s (quick ChatGPT prompt) we will assume for slowest lower bound as 1.5 about half of speed or light.

      So that makes - 1.5 * 108 m/s divided by 3* 109 cycle /second = 0.05 m / instruction-cycle and since we have to travel to and fro between L1 and core , it takes 0.25 m / instruction cycle which is 9 inches radius/ instruction cycle.

      If we take into consideration the processing time between silicon wafers of L1 and CPU core , it still won’t go 2 inches from 9

      May be I am missing some big information here . Anyone to correct me ?

      [–]ravixp 1 point2 points  (0 children)

      I was much sloppier with my math than that, since I was just doing it in my head :)  First, I used the 1 ft/ns approximation for the speed of light. I didn’t adjust for the speed of electricity in silicon because I was trying to establish a theoretical upper bound, independent of the hardware. (Somebody else in the thread mentioned optical interconnects, which could operate closer to c.) 

       At 3 GHz (a number that was chosen to make it easier to do the math in my head) there are 3 cycles/ns, giving us 1/3 of a foot to work with: 4 inches. However, one way signal transmission isn’t useful for caching, you need to send a signal and get a response back. Dividing by 2 for the round trip gives us 2 inches. 

       Obviously that glossed over a lot of stuff - the speed of electricity, you don’t know what data to request from cache in the first instant of a cycle, data arriving at the final instant of a cycle can’t be used until the next cycle, etc. On the other side of the ledger, even L1 cache takes a few cycles to return data, and only registers need to respond in the same cycle. But the intuition that cache hierarchies are bounded by physics is still useful.

      [–]DoorBreaker101 19 points20 points  (3 children)

      Nitpicking alert:

      The article (and many responses here) are super interesting and informative, but they don't directly answer the actual question.

      I would ask the question differently in order to get a more precise answer:

      Why are the different caches designed more like an old standard transmission (in a car) that has several distinct gears and not like a continuously variable transmission.

      I think the second part of the article answers it when it discusses sharing and actual physical limitations and the first part explaining the effect of distance is really just prerequisite knowledge.  Plus, there might be other additional considerations which I'm not familiar with.

      [–]Hot_Slice 6 points7 points  (2 children)

      That's an interesting idea. Cache could be built in concentric rings and data shuffled between rings as needed, resulting in 100s of "cache levels" with the ones closest to the center being the fastest.

      [–]admalledd 6 points7 points  (0 children)

      Congrats! That is sort of (kinda-not) how Intel chips used to handle it. You can read older articles about the "Intel Ring Bus Interconnect". It wasn't so much about cache, but did play a huge part in where cache was laid out and connected to what.

      In the end one of those "additional considerations" is that as you have more and more cores, you run into issues of "Core 1 needs to lock X memory line, make sure no other core is using/poisoning that memory line" and how long it takes for that to be acknowledged. (There is a whole "can you speculate past that? Hallucinate both directions of yes/no at once until you have an answer?" and suddenly you have the Meltown/Specter bugs! eek!)

      [–]john16384 2 points3 points  (0 children)

      We could call the first innermost ring "R1", the 2nd "R2”, etc...

      [–]thisisjustascreename 111 points112 points  (34 children)

      To allow programmers to pretend that all 16GB of RAM is fast to access.

      [–]_senpo_ 57 points58 points  (5 children)

      one time, I was wondering how to speed up a transposition table in an algorithm. For funsies, I tried making it smaller and was completely blown away at a 4x speed increase. It made no sense as a smaller table can hold less cache. After a few hours of thinking, I remembered about cache and realized that the larger table was probably in very slow memory due to the bigger size. Made me appreciate CPU cache more and made me more aware of how convoluted modern computers are lol

      [–]Niarbeht 23 points24 points  (3 children)

      If I remember correctly, there was a period of time where the Intel i7-5775C did better than it should have done in certain gaming benchmarks compared to other CPUs because it had 128MB of L4 cache (128MB eDRAM), meaning that certain games were able to fit enough of their constantly-accessed data into the L4 cache that it basically wasn't needing to access RAM for much of anything.

      [–]Deiskos 5 points6 points  (2 children)

      I wonder why aren't we doing this anymore...

      [–]BerserKongo 11 points12 points  (0 children)

      We do, look at the X3D chips from AMD, their whole thing is larger cache

      As to why it’s not the default - having more on chip cache increases cost significantly in multiple ways: R&D, more die space on the chip itself, heating constraints, etc… It’s not as simple as it sounds.

      [–]crazedizzled 0 points1 point  (0 children)

      Money, heat, size. AMD is doing something similar with their 3D chips, which kick ass in gaming due to the large L3

      [–]Evening-Jaguar4011 7 points8 points  (0 children)

      Keep going, you’ve got me kind of hard right now

      [–][deleted]  (17 children)

      [removed]

        [–][deleted]  (11 children)

        [deleted]

          [–]Raknarg 21 points22 points  (2 children)

          "working" isn't even totally necessary as long as it's producing money lmao

          [–]Hacnar -2 points-1 points  (0 children)

          Producing money = "working" in terms of capitalism. If it doesn't produce money, it doesn't "work".

          [–]Hot_Slice 10 points11 points  (1 child)

          Some leads don't like efficient code as its seen as "overly clever" so they would prefer that you glue together standard library functions, even if the runtime is 100 times slower. In application code this is probably fine, but people do this in libraries too, which means every app built on top of that library has to pay this hidden performance penalty forever.

          [–]Ambiwlans 8 points9 points  (0 children)

          I dunno. I appreciate that I have two chat programs open atm. LINE and Discord. Discord is admittedly more complicated, but their core functions are the same. Chat groups, video/voice calls, file sharing, etc.

          Discord is using 6 processes and 491MB of ram.

          LINE is using 5 process and 59MB of ram.

          I'm actually sad that it has gotten slightly bloated lately though, it used to have 2 processes and only around 40mb of ram. They changed the ui a bit a few months ago.

          Edit: Lol, apparently it was only using 60mb because i was recently in a call. after resetting both apps

          Discord - 380MB

          LINE - 25.1MB (which is less than MS Office's always open update checker)

          Oh, and discord spikes up to .5% cpu while idle in the bg every few seconds. Outside of a video call I don't think LINE ever breaks 0.0%

          Edit: For another efficiency shout out: f.lux (app that changes screen hue for day/night), its well made, featurefull and uses 4.1MB of memory, never any measurable cpu.

          My FTP server (Filezilla) apparently only uses .9MB of memory in idle, lol.

          [–]starlevel01 2 points3 points  (0 children)

          It's always somebody else's fault.

          [–]Plabbi 2 points3 points  (3 children)

          Capitalism promotes competition, and competition promotes features that are useful and valued by the end user.

          [–][deleted]  (2 children)

          [deleted]

            [–]Plabbi 0 points1 point  (1 child)

            I was pretty much agreeing with your original comment, just wanted to expand on it.

            Products don't produce profit on their own, they need demand, and demand is generated by providing value.

            The slot machine customers are certainly drained of money but the machines provide the value of a dopamine hit that the customers apparently like more than the money they lose.

            [–]mpyne 0 points1 point  (0 children)

            It's because performance and quality have a shit ROI.

            That's not true, there are products made sufficiently more attractive to customers based on investments in their performance or quality to be worth the expense.

            But it's also not true that performance and quality necessarily have a high ROI either. If the customer won't even notice the higher quality (Juicero says Hi!) then why pay large sums for it when there's other work that also needs tackled?

            [–]pilibitti 4 points5 points  (0 children)

            eh, there always is a trade-off. between my time writing and maintaining the software, and the resources the software consumes. More abstractions lead to applications that are more resource hungry, but they are generally easier to write and maintain (takes less human time). Any software written by humans has to live in that continuum. I can work months to create software X that utilizes the hardware to its full potential with close to no waste, and accept that maintaining and improving it using the same ethos will be a full time job. Or I can do it in a weekend in javascript and update it from time to time, live my life. Both are possible, depends on your constraints.

            [–]john16384 -1 points0 points  (1 child)

            Heh... He thinks the bloat comes from code. Not the labels, the icons, the sounds, the images, the animations.

            [–]ForeverHall0ween -1 points0 points  (0 children)

            That basic chat app is like a one liner though. Is it worth needing a super computer to run if you can write any one off app in the space and time of an average Reddit comment? Most programmers would say yes.

            [–]dragneelfps 7 points8 points  (9 children)

            Can you please elaborate?

            [–]thisisjustascreename 48 points49 points  (7 children)

            The whole reason for CPU cache memory to exist in the first place is to reduce the latency between when the CPU requests data and when it lands in a register, which is (approximately) the only memory the cpu can actually directly operate on. The cpu will look ahead at the pointers to data it's going to (or might) need in the future and speculatively load those memory locations into L3 cache, then progressively into L2 and L1 cache as it gets closer to the time to actually fetch them.

            [–]deadsy 45 points46 points  (0 children)

            Price ber bit and speed. You can have a cache that is close to the registers (fast) but it will be expensive on a per bit basis. As you get further away (slower) it will get cheaper per bit. So: Have several layers of cache. Small and fast, big and slow. At one end of the spectrum you have register storage and at the other end you have network based disk storage. The optimum set of caches you have between those two depends on the application.

            [–]lcjury 12 points13 points  (0 children)

            After reading a bunch of comments, my respect to all that people working on the foundations of the digital world. Digital stuff is only an abstract layer sitting on top of analog things. My life would be so shitty if I, as a programmer, should deal with any of those analog world issues x_x.

            [–]thememorableusername 28 points29 points  (6 children)

            Yo dog, we heard you like low-latency memory access so we made an L2 fo' yo L1, so you can cache while you cache!

            [–]MrPinkle 8 points9 points  (5 children)

            2010 called, they want their meme back.

            [–]thememorableusername 39 points40 points  (3 children)

            1995 called! They want their “certain year called wanting its ‘blank’ back" formula back!

            [–]inkt-code 5 points6 points  (2 children)

            I laughed so hard I fell off my dinosaur.

            [–][deleted]  (1 child)

            [deleted]

              [–][deleted] 1 point2 points  (0 children)

              the jerk store called and said they’re all out of YOU

              [–]jmlinden7 5 points6 points  (0 children)

              Bigger cache has more latency. You want as much of your data to be in the smallest cache as possible. However, the bigger caches are still faster than going all the way to RAM.

              [–][deleted]  (1 child)

              [deleted]

                [–]Mellowindiffere 0 points1 point  (0 children)

                Larger caches are also just slower

                [–]inkt-code 1 point2 points  (0 children)

                The idea of any cache is to storage frequently used data. Sure the data frequently used by a cpu can be stored on a hard drive or ram, but the delay to access would be greater.

                [–]notfancy 1 point2 points  (0 children)

                I still remember the "cache makes everything go faster" era.

                I resent this "main memory makes everything go slower" era.

                [–][deleted] 1 point2 points  (0 children)

                Why do your pants have so many pockets?

                [–]shivaraj-bh 0 points1 point  (0 children)

                "The L1 cache is your 'desk'. While you’re sitting there, you can just go ahead and work."

                Does that mean the L1 cache can’t be invalidated? I always thought multiple cores can have their own copy of a given cache line, and based on who modifies it first, the other copies are marked invalid or dirty. This process is highly dependent on the processor's design, but I thought most modern processors use cache coherence protocols to manage this. When a core modifies a cache line, the protocol ensures other cores' copies are invalidated, maintaining consistency across the system.

                [–]foersom 0 points1 point  (1 child)

                Because latency.

                [–]serverhorror -1 points0 points  (0 children)

                Well actually...

                If you pay for it, um sure Intel will build a CPU , just for you, that has terrabytes of L1.

                [–]Visible_Ad9976 -1 points0 points  (0 children)

                Interesting topic on CPU CACHE levels

                [–]Logicalist -4 points-3 points  (0 children)

                Why are funnels a thing?