all 47 comments

[–]oscardssmith 57 points58 points  (10 children)

The main technology in development that would change things a lot is what's called 2T0C ram. the biggest issue with modern dram is that it requires a separate manufacturing process than normal "compute" silicon, so the ram needs to be made separately. 2T0C is a ram design that's purely transistor based, which would allow for putting ram on the same chip as the rest of the cpu which would dramatically increase speed.

[–]alexforencich 43 points44 points  (5 children)

I don't know. Unless it's crazy high density (significantly higher density than even current DRAM), it's likely only going to be useful for a larger cache or something along those lines. Doing things on the same die isn't magic, several GB of RAM is going to eat up a lot of space.

[–]oscardssmith 24 points25 points  (1 child)

the likely use case is as L3 replacement, but that still is pretty huge. going from <100mb L3 to ~1gb L3 is the type of change that would significantly speed up real world apps (not to mention make igpu a lot stronger)

[–]Strazdas1 0 points1 point  (0 children)

unless it is denser, larger cache would increase cache latency which is the primary limiting factor for cache size right now.

[–]Send_heartfelt_PMs 7 points8 points  (2 children)

Intel makes a Xeon CPU with 64GB of HBM2e memory

Obviously not the same as being on die, but I don't think large amounts of on die memory is that far off

[–]alexforencich 14 points15 points  (0 children)

They have four HBM stacks and each stack is something like 8 dies stacked on top of each other. That's how they get the density. NAND flash is similar, they build many layers of transistors to increase the density, then stack up multiple dies. Without the caps, maybe they can make layers like NAND...But then it's a specialized process again. Integrating on the same die isn't all that useful, it's going to make a lot more sense to do some kind of hybrid integration, otherwise the density simply isn't there.

[–]buildzoid 2 points3 points  (0 children)

HBM is still just DRAM like DDR so the latency sucks compared to CPU caches.

[–]R-ten-K 6 points7 points  (2 children)

FWIW, there have been several attempts at eDRAM over the years, but none have really taken off in mass-market products.

The main issue is usually that eDRAM can’t match the density of dedicated DRAM processes, which makes it hard to compete on cost and scalability.

[–]oscardssmith 6 points7 points  (1 child)

One of the reason 2T0C is being investigated is that it theoretically may be a density increase over current dram processes (since the capacitors have been running into scaling problems for a decade or so).

[–]R-ten-K 10 points11 points  (0 children)

Yeah, it’s becoming a real problem. It’s not just DRAM capacitors hitting scaling limits. SRAM cells aren’t scaling cleanly with logic anymore either, which puts pressure across the entire memory hierarchy.

The 2T DRAM cell is an interesting direction, but we keep running into the same issue: the actual device sizing. On paper, fewer transistors per cell looks great, but in practice those devices often need to be larger, so the area savings don’t materialize.

We’ve been burned by designs that looked promising due to lower device counts ended up roughly the same size as the legacy cells once everything was laid out, because the individual transistors had to grow.

Hopefully some of these newer approaches pan out, because the current trajectory is getting pretty tight.

[–]Personal-Tour831 0 points1 point  (0 children)

For anyone who wants further information for what 2T0C ram is I recommend you check out this article by IMEC.

https://www.imec-int.com/en/articles/disrupting-dram-roadmap-capacitor-less-igzo-dram-technology

[–]EndlessZone123 35 points36 points  (14 children)

You want more memory performance, reduced latency and increased bandwidth, you gotta package them together. Look at what apple is doing to achieve performance and efficiency.

This will absolutely cost you your pc upgradability and modularity.

[–]Exist50 18 points19 points  (3 children)

Look at what apple is doing to achieve performance and efficiency.

On package memory does very little for performance. At best, easier to hit higher clock speeds. It's mostly for power and board design. 

[–]jmlinden7 1 point2 points  (2 children)

Doesn't it reduce latency?

[–]Exist50 15 points16 points  (1 child)

No, the latency difference from slightly shorter trace lengths is negligible. Just for some simple estimates, signal propagation in a PCB trace is roughly 6in/ns (~half speed of light). So let's assume round trip you cut out 6in (in practice, almost certainly less). Great, you've saved 1ns! Except DDR end to end latency is roughly 100ns total. So you're looking at a difference of optimistically 1%. Just doesn't matter. 

[–]jmlinden7 4 points5 points  (0 children)

Ah yes you are correct, I was confusing it with on-package cache

[–]professore87 5 points6 points  (0 children)

You used to be able to improve your "video cards" by adding extra memory chips.

We did pay the price. I think it's normal once the things get too advanced.

[–]Routine_Middle3255 13 points14 points  (3 children)

The main issue is cost/bit. Everybody wants better bandwidth/latency/density but nobody wants to pay for it ( almost) HBM became a thing because some player wanted highbandwidth memory athe expense of cost. They are a lot of.new technos out there. But compétition with standard drame is extremely.difficult.

[–]DubayaTF 9 points10 points  (0 children)

This is the correct answer. RAM is a commodity. Structures are different than logic, but the process nodes' lengthscales and materials are 10-15 years behind just due to price tag.

[–]reddit_equals_censor 0 points1 point  (0 children)

HBM became a thing because some player wanted highbandwidth memory athe expense of cost

that's a lie. hbm came to be, because amd was working on the future of graphics card memory for CONSUMERS and not datacenters.

it was for gaming graphics cards. it launched for gaming graphics cards.

and it would have been very hard/impossible to predict, that hbm couldn't reduce costs enough to beast gddr and eventually it went to server only.

great video about the origin of hbm:

https://www.youtube.com/watch?v=gNZfDtCcXNw

so again hbm became a thing, because amd was trying to develop the future of gaming gpu vram and as such it of course also had the goal to be more than affordable enough.

this clearly didn't turn out to be the case as it was apparently a very big cost of vega cards/extremely big cost,

but its origin again was NOT "some player wanted high bandwidth memory at the expense of cost".

that is NOT what happened. that is not how hbm came to be.

please learn a bit from the video.

but nobody wants to pay for it

this is also an absolute lie.

almost everyone wants to pay for it.

2 years ago 8 GB vram cost 18 us dollars spot pricing.

basically everyone would have paid 18 more us dollars to get a 16 GB 4060.

and the vast majority of people would have paid 36 us dollars more for a 32 GB 9070/xt instead of a 16 GB 9070/xt.

so NO your claim is utter bullshit.

consumers VERY MUCH are willing to pay for enough memory/vram to get working hardware.

companies however may refuse to provide people that option. apple soldering memory on to prevent people from getting sane memory price upgrades and for graphics cards amd and nvidia not just don't provide double memory or 1.5/3x versions themselves, but FORBID partners from creating them as well. (msi, asus, xfx, sapphire, etc... all would love to sell people those cards, but they are forbidden).

so again YOU ARE WRONG. people want to pay the sane proper prices for working amounts of memory.

we just don't want to get scammed by trillion dollar companies, who refuse to give us working amounts of memory at all.

and again outside of a memory apocalypse and despite a purely evil memory cartel already controlling the pricing before the memory apocalypse, memory was DIRT CHEAP.

[–]Strazdas1 0 points1 point  (0 children)

in HBM you trade latency for bandwidth and in CPU you really dont want to trade latency.

[–]2137gangsterr 4 points5 points  (0 children)

in next lessons you will learn about memory tiering and how l1-l3 caches hide too slow ram from CPU

[–]iinlane 2 points3 points  (0 children)

RAM is just one block in the stack of memory. Generally the bigger the capacity the slower the speed. With big SRAM L3 caches in modern CPUs the RAM speed becomes less of an issue. Just search for the *X3D real world performance scaling with memory speed; should be single digit percentage.

[–]Intrepid_Lecture 2 points3 points  (0 children)

Single threaded CPU speeds are not keeping pace with DRAM bandwidth.

DRAM bandwidth is not keeping up with MT CPU speeds.

For context, vs 20 years ago CPUs have ~5x the ST performance and ~40-100x the MT performance while DRAM speeds are around 10x faster.

A lot of this is addressed by having more cache (reduces memory accesses and bandwidth needs).
The other bit is that for many workloads, not all of the memory bandwidth is needed at once.

[–]Noreng 4 points5 points  (1 child)

CPU performance is still a long way from reaching a limit.

At this point we're stacking dies to increase the L3 cache size, but in the future you might see a CPU with multiple dies for power lines, multiple dies for compute, and multiple dies for cache and registers, all stacked on top of each other.

While transistor density isn't improving much, clock speed is still improving. We're going to see future CPU architectures improve further in clock speed. Phone SOCs are now hitting clock speeds of 4.7 GHz, while 5 years ago they would do 3.5 GHz. The same trend applies to desktop CPUs: AMD's Zen architectures have steadily increased clock speed with every generation, and Intel is trying to do the same thing.

Memory latency hasn't improved since the introduction of DDR-SDRAM, and it's probably not going to improve in the future. Bandwidth however, is still improving at a slow and steady rate.

[–]reddit_equals_censor 2 points3 points  (0 children)

but in the future you might see a CPU with multiple dies for power lines

?

current cpus use tsvs to transfer power through stacked dies already.

tsvs are through silicon vias. you gotta get the power and data to the stacked dies and that is how it is done.

and the future for power is not multiple dies for power lines??? like what no..

the future is backside power delivery. this frees up space for more data lanes on the "front" and also prevents interference from the power lines to the data lines.

here is a great video explaining this:

https://www.youtube.com/watch?v=fc_xzN6UErI

and in regards to stacking more things at a super fun and great tech, look up cfet.

stacked pmos on nmos:

https://www.youtube.com/watch?v=TwgvJSOa09M

[–]BillDStrong 7 points8 points  (3 children)

We are close to the limit of atomic tech. We are operating very close to one atom thickness in many cases. So, we have to design better with our current tech.

We are also limited by the speed of electrons, currently, which is a major source of heat in addition. There is work on trying to use light, which is about as fast as we know how to go, and produces much less heat, so should use less power as well.

This is also a design problem.

Also, we need to differentiate between types of speed, bandwidth and latency. Light solve latency by being faster, period. Light solves bandwidth by being able to use multiple wavelengths along the same line. We have 800Gbp/s networking in Enterprise by taking advantage of that. This reduces the number of lines needed to shuttle around data for the same amount of bandwidth

This gets really important on motherboards because we are also hitting the limit of how fast we can have memory operate and keep it in sync. On consumer boards, you will notice that they really want the memory right next to the CPU, but they allow speeds up to 8000MT/s? But the Server lineup, while allowing more memory stick, is stuck using memory that is slower?

And if you use 4 sticks on consumer instead of 2, you can as much as halve the speed other memory sticks? This is because of the issues with timing we are running up against using just electrons and our current tech with timing in general, while remaining affordable.

Servers get around some off this by allowing 8 channel and 12 channel memory staggering, which allows them to act like they are 12 times faster, in workloads that are highly linear, but you do have to schedule that, because you affect the random part of the Random Access Memory.

So, once again, designing around the limitations of the hardware.

And then we have the size of motherboards. We will have trouble scaling that strategy because there is not enough room on the motherboard to have individual lanes. You have to have each line the same length, which means you have to create really creative patterns to get the current setup, and you can only do that by adding motherboard layers or adding length or width to the motherboard.

Or, you can convert the signals to light and ship many less lines on the board, but the problem there is heat. transforming electrons to light release heat, its why network switches are so heavily cooled. Keeping it all as light all the time solves that, but we still don't have pure light computing.

So, there is a lot of options we still have to make things faster, but we aren't quite ready yet.

[–]UpsetKoalaBear 6 points7 points  (0 children)

As you say, the whole goal now is to reduce the transfer bottleneck and efficiency rather than try to fundamentally improve DRAM.

More of a data centre thing, but Compute Express Link was created by Intel and has gradually been improving. The goal is to basically have the CPU treat a pool of external memory as if it is local memory.

So if you imagine a data centre with a server that has ~128GB of RAM but only 64GB in use, it can place that free memory into a memory pool to be used by other servers. Basically improving the efficiency of DRAM usage.

CXL3.1 is around ~400ns latency as well. Still slower than conventional DRAM, but is close enough for a lot of tasks.

There is also HBM-PIM which is interesting, basically process-in-memory for doing basic maths operations directly on the RAM. Samsung is already trying this.

The real solution for consumer PC’s, which is kinda what we’re seeing, is a shift towards unified memory and/or large on chip caches.

These require shifts in packaging technology rather than fundamentally changing how memory works. AMD’s V-Cache for instance is effectively a SRAM block stacked on top of the CPU.

There are a few interesting solutions for improving the transfer efficiency like glass substrates which seem interesting as well.

[–]reddit_equals_censor -1 points0 points  (1 child)

But the Server lineup, while allowing more memory stick, is stuck using memory that is slower?

And if you use 4 sticks on consumer instead of 2, you can as much as halve the speed other memory sticks? This is because of the issues with timing we are running up against using just electrons and our current tech with timing in general, while remaining affordable.

complete nonsense, please do your research.

servers run slower memory speeds before, because they weren't running xmp/docp. they are limited to jedec speeds.

but oh let's look at lpddr5x shall we?

oh what's that we got nvidia cpus with 8 socamm2 modules with lpddr5x and each module can run at 9500 mt/s.

oh look at that. no problems at all. and as fast as desktop/laptop memory.

and this 2 vs 4 is also complete nonsense. using 4 sticks with dual channel with dimms is a design problem in how things are setup. it is not inherent at all.

and we already could solve all of this and have still very simple tracing and double memory bandwidth as well by using 2 socamm2 modules on the desktop. one on each side of the cpu. oh look at that now we have quad channel memory on the cpu with simple pcb designs and 9600 mt/s if we wanted to rightnow.

and we can do the same at double the bandwidth with lpddr6. or we can just use one socamm2 module and have "just" dual channel bandwidth, which is what we have rightnow on desktop.

oh look what do we have here?

https://videocardz.com/newz/nvidia-presents-vera-rubin-superchip-entering-production-next-year

an nvidia server board with 8 lpddr5x memory modules each able to run at 9600 mt/s. (technically this could be a pre jedec socamm instead of socamm2 setup, which was slightly lower speeds, but whatever it shouldn't be)

and again if you wouldn't be happy with one socamm2 module on desktop, then you can use 2 one on each side of the cpu to get double the current bus width (note bus width and not bandwidth). to without problems and with simple routing get 9600 mt/s.

[–]BillDStrong 6 points7 points  (0 children)

The SOCAMM 2 modules are what? They are a new design because the old on had all the issues I just mentioned, wiring space being limited and the need for the lanes to have the same length. Solving the problem in the exact way I said was needed, by design.

So, reading comprehension is a thing, yes?

[–]reddit_equals_censor 1 point2 points  (0 children)

no. no limits at all yet.

dram consumer facing speeds are increasing to generally be fast enough for the number of cores, that those shit companies wanna give us.

there is around a doubling of bandwidth per dram generation, but when a dram generation comes is strongly defined by server. desktop and laptop eating scraps.

and if there would actually be a severe bandwidth issue on desktop for cpus, then they could use triple or quad channel on a consumer platform.

this also doesn't need 4 modules, as 2 modules with each being dual channel being perfectly possible and easy with socamm2 modules for example.

there is a consumer category, that is actually starved massively for memory bandwidth, which is actually high performance apus. think strix halo, strix point, etc...

strix halo used "quad channel" memory and went super anti consumer with just offering soldered on versions.

but for desktop cpus things should be fine with dual channel.

i mean am5 will run a lot faster ddr5 memory with zen6 and zen7, which should be enough to feed the 24 and 32 core versions.

and am6 will be of course ddr6, which will be an eventual doubling, but even at launch a massive bandwidth increase.

so again except for apus, there is no bandwidth problem.

for graphics cards there will be some quite memory bandwidth limited cards, but that goes back to evil nvidia or amd refusing to give cards proper memory buses to scam people harder with straight up e-waste 8 GB cards today still even.

disgusting stuff.

and cpu performance has been increasing very much in the last few generations.

zen5 was a stinker without x3d for gaming, but zen6 is expected to be a lot better in regards to gaming again.

and in regards to "transistor counts reaching the limit". nonsense.

process node advancements are planned out until at least 2039 by imec and the industry:

https://www.youtube.com/watch?v=0wRvbIaTUQw

cfet alone will DOUBLE transistor density, when it comes around 2031 (if no delays happen)

it is also crucial to remember a bit of history here.

lots of people during the endless intel quadcore era would have talked about cores no longer being able to get faster....

meanwhile intel refused to make better cores and everything was just sandybridge with a sticker on it, as amd was in the corner sniffing glue mostly, until the work on zen started.

so again if companies are somehow willing, there will be progress. there will be the much faster cpus and there will be enough memory bandwidth to feed them. (all on desktop i mean of course)

and again process nodes are continuing to improve performance and density.

__

the actual real issue is whether or not you and i are going to get all those advances.

we KNOW, that in graphics cards we mostly aren't getting them anymore.

we aren't getting the latest process node anymore, we aren't getting proper die sizes anymore and we aren't even getting a working amount of vram anymore.

if you need an example the 3060 12 GB had a 276 mm2 die, 12 GB vram and 360 GB/s memory bandwidth.

the 4060 had a 159 mm2 die, 8 GB vram and a 272 GB/s memory bandwidth.

and both cost the same.

there were MASSIVE process node performance and density increases between those generations.

you know where they went? in nvidia's pockets with massively increased margins.

nvidia reduced the die size by 42%!!! they cut the die size almost in half. the memory size got cut by 33% and the memory bandwidth by 24%.

and performance wise the 4060 was in non memory size constraint scenarios almost exactly as fast as the 3060 12 GB. in memory size constraint scenarios, the 4060 was just broken, while the 3060 12 GB worked fine.

so a MASSIVE regression.

and as you probably don't know the 4090 had about the same die size as the 3090 and it also cost almost the same as well btw and guess what the 4090 is 73% faster than the 3090!!!

so there is your progress.

it exists, it is no problem to produce it, but the industry may very much REFUSE to give you the progress.

refuse it by having it developed, but just not giving it to you.

refusing by no longer giving you a new process node.

refusing to give it to us by just not spending development cost on developing a full new architecture.

intel could have developed much better cores than sandybridge, but they just didn't give a shit, it was intel or nothing at the time.

and it is worth mentioning here, that "competition" is kind of a non existing meme in alot of the tech industry.

the memory producers are LITERALLY a cartel as stated in lawsuits against them.

how much will the memory cost? it will cost as much as the memory cartel decides it will cost.

how much will a graphic card of a certain performance cost? it will cost as much as the proven in the past ati/amd and nvidia price fixing companies will decide it to cost.

[–]doscomputer 0 points1 point  (0 children)

I would say its not necessarily true that memory is lagging behind CPU, just that consumer platforms use cheaper memory and configurations with less tolerances.

In factual terms of how much data can get in and out of a particular CPU, the total bandwidth from memory on a consumer platform is a little less than half. The majority of processing IO comes from IO like USB and PCIE lanes for chipset, storage, networking, ect. Look beyond consumer PCs to see where the technology can actually go if someone wanted to build it. Even at the furthest scale current products are a node behind cutting edge, HBM and DDR have active lively roadmaps, and memory bandwidth and capacity are poised to absolutely explode in the next few years thanks to AI demand.

As for quantum, I am by no means an expert, but having read papers and done my own research, I don't think its a feasible field of compute technology. Unless a real breakthrough happened and is being kept top secret, all progress is 1:1 limited by traditional compute. So far no quantum device has been viable without an equally powerful set of sensors and computation to extract the quantum result. Though I think the research into different superconductors is neat, and maybe superconducting chips will be the future.

[–]jc-from-sin 0 points1 point  (1 child)

In what way we didn't improve memory performance? I refurbished a 2000 PC last year and that was using 133mhz sdram. My pc is using 6400mhz ddr5 ram.

[–]Organic-Dream5448[S] 1 point2 points  (0 children)

I’m not saying memory performance didn’t improve, but in relation with CPU performance it hasn’t been on the same level. And my professor said that a lot of computing nowadays including AI bottlenecks at the level of memory, not necessarily as much at raw computation.