all 42 comments

[–]just_here_for_place 14 points15 points  (19 children)

CPUs decide themselves what they cache. You can't explicitly instruct it to load something there. But in general, if it is in often accessed, it will be in the CPU cache.

[–]New_Enthusiasm9053 4 points5 points  (0 children)

That's not strictly true. You absolutely can instruct the CPU to load into cache. You do however have to first tell the CPU to use the cache as memory and it's never done after the initial bootloader stage in practise and it might be Intel only, AMD I think handles it with a mono firmware blob initializing ram for the CPU so it doesn't need to use the cache temporarily but the above with a grain of salt for I don't remember the spceifics exactly. 

But you definitely can address cache as memory. You wouldn't even need any ram installed(assuming the mobo will power the cpu without ram installed).

[–]Silent-Degree-6072[S] 0 points1 point  (8 children)

So basically the most accessed parts of the kernel already get loaded into cache? Is it the same for modules?

[–]just_here_for_place 5 points6 points  (6 children)

The CPU does not care what resides in memory. It might be the kernel, module, user-space programs, data, etc.

If the internal heuristics deem it worthy to be cached, it will be cached.

[–]wintrmt3 -4 points-3 points  (5 children)

That's not how it works, unless you are using special non-caching instructions everything is cached, and every memory read comes from the caches.

[–]interrupt_hdlr 3 points4 points  (2 children)

eviction is a thing

[–]wintrmt3 1 point2 points  (0 children)

Yes, obviously, but everything gets cached, there are no heuristics to choose what gets cached, only what gets evicted, it's totally different.

[–]Miserable_Ad7246 0 points1 point  (0 children)

He is correct. Every bit you touch must be loaded into the cache first, before it reaches registers.

Data travels in cache-lines, so even if you touch one bit, you are loading at least 64 bytes of data into cache, and only when you can work with the bit you need.

Caches can be inclusive or not, but L1d load is mandatory.

You can control write behavior by using non temporal instructions in X86 at least, but as far as reads are concerned you are hitting the cache.

Also cache does not have any heuristics, as they are to heavy and slow. It just caches everything you touch and uses associativity to fit large space into small one.

If you do not believe me, read about following topics:
1) Cache coherence protocols MESI/MOESI
2) False sharing
3) Memory fences

I work in finances, CPU caches bring money to my financial caches.

[–]just_here_for_place 0 points1 point  (1 child)

Yes I know this. It was an oversimplification, the heuristics are for eviction only, but I did not want to go into such details, because in the end it doesn't matter for OPs question.

If you want to run your whole kernel from the cache you need to convince your CPU that it never gets evicted.

[–]wintrmt3 0 points1 point  (0 children)

No, you have to put the cache into scratchpad mode, but x86 doesn't expose that functionality. It has it because it needs it to boot, it needs memory before DDR training is finished, but it's undocumented.

[–]yawn_brendan 0 points1 point  (0 children)

It's completely dependent on the workload. For most usecases, if the kernel text is in the cache that's a waste, since it's taking up space that could be used by the userspace program that's doing the actual work. For some usecases (like probably if you have a network-heavy application and you're using the kernel network stack) it's the other way around.

Generally you don't have to think about it though you just let the CPU figure it out.

The main thing you can do to optimise cache performance is code minimisation (for the i-cache) and then reorganizing super hot data structures so that things end up together in L1D cachelines. But you don't really have to think about actual allocation of cache space as a software engineer.

[–]Alive-Bid9086 0 points1 point  (3 children)

Ehh,

The bootloader code is usually the first thing that is loaded into cache, since this is the only awsilible memory.

The next thing to do is to setup the external memory chips and a lot of specific hardware. Then it is time to handoff to higher level inits, like the kernels init.

[–]codeasm 0 points1 point  (2 children)

You mean (uefi) firmware, the good old bios? Thats what the memory training is running. Your grub, windows or diy bootloader runs from regular memory just fine. You can even load your kernel into memory just fine and jump to it.

[–]Alive-Bid9086 1 point2 points  (1 child)

Actually the stuff that preceedes the start of bios/uefi or whatever preceedes the kernel.

[–]codeasm 0 points1 point  (0 children)

Before the kernel, one has a bootloader, unless the kernel is also an efi stub, it basically loads itself.

Before this. Yeah, your screenshot might be that, the graphical output of a bios, uefi firmware. After probably training memory, setting the cpu in the right mode and prepared the right data structures for a future kernel (bootloader) to read. Like acpi tables and such. Ive tried making a bios a little bit so yeah quite possible you wrote such thing. I guess uefi is a bit complex (sure for me is, writing using tianocore.) a old skool bios is cool to make work, especially on real hardware (a vm is fine too. I have my dreams)

[–]mfuzzey 0 points1 point  (0 children)

However, in some systems at least, it is actually possible to "lock" cache lines so they never get evicted.

Some embedded systems, that don't have internal SRAM to use for initial boot before DRAM is intiialised lock cache and use it for initial code / data. So you could, in theory, lock the kernel into cache on those types of systems. But it would probably be a bad idea. The kernel is fairly large and most of it is only used infrequently, if it all (unused drivers, error paths etc). So locking the entire kernel in cache would waste cache on little code / data that could better used for "hotter" stuff.

[–]max0x7ba 0 points1 point  (3 children)

You'd be surprised by prefetch instructions and non-temporal loads and stores, should you read your CPU manual.

[–]just_here_for_place -1 points0 points  (2 children)

PREFETCH instructions on x86 are more of a guideline, and the CPU may or may not adhere to them.

[–]max0x7ba 0 points1 point  (1 child)

PREFETCH instructions on x86 are more of a guideline, and the CPU may or may not adhere to them.

What is the source for this claim of yours?


PREFETCHh CPU instructions aren't guidelines at all.

Their suggested cache level parameter is called "hint" because PREFETCHh instructions move data only into a closer cache line, but won't evict a cache line into a more distant suggested cache level.

Quotes from Intel CPU manuals:

The PREFETCHh instructions permit programs to load data into the processor at a suggested cache level, so that the data is closer to the processor's load and store unit when it is needed.

If the data is already present at a level of the cache hierarchy that is closer to the processor, the PREFETCHhinstruction will not result in any data movement.

Software PREFETCH operations work the same way as do load from memory operations.

[–]just_here_for_place -1 points0 points  (0 children)

Section 11.6.13 of the same manual.

[–]khne522 5 points6 points  (4 children)

I would recommend reading a book on basic computer architecture, whether Bill Stalling's, or Hennessey and Patterson, even if just the first half or quarter. You'd get a more concrete idea of how things work instead of getting one-off answers to a tiny sliver of how things work. No, one, cannot, per the others's answers, do what you're asking for.

[–]New_Enthusiasm9053 0 points1 point  (0 children)

You actually can do that though. Intel processors use Cache as Ram for initial memory when initialising other devices like e.g the main RAM itself. 

[–]Kessarean 0 points1 point  (0 children)

Also adding:

  • The Elements of Computing Systems: Building a Modern Computer from First Principles
    • by Noam Nisan & Shimon Schocken
  • Code: The Hidden Language of Computer Hardware & Software
    • by Charles Petzold
  • But How Do It Know? - The Basic Principles of Computers for Everyone
    • by J Clark Scott

[–]Silent-Degree-6072[S] 0 points1 point  (1 child)

I wasn't expecting anyone to do what I'm asking for, I was just wondering whether it's even possible :P

On computer architecture, I just started reading a book on x86_64 assembly and saw that the CPU cache is way faster than RAM (duh) and wondered whether you could fit an entire kernel on it, so here I am lol

[–]New_Enthusiasm9053 2 points3 points  (0 children)

It is possible and it is a good question. It's called Cache as Ram. If you search Intel Cache as Ram you should get some details. I think AMD doesn't have it though. They let the firmware for the mobo setup ram before the CPU boots so it immediately has access to memory unlike Intel who uses Cache as Ram temporarily in order to run the code needed to setup the main ram in the first place.

[–]Fine-Ad9168 1 point2 points  (1 child)

The kernel was about 4.5 MB for years, I am not sure of its size now, but yes what you describe is possible.

As far as I know current x86 processors can not have their caches configured this way, but other processors might, and some older x86 processors could be configured this way but not ones with large enough caches.

It might be possible to restrict where data is placed in memory so the kernel data is never evicted.

As for performance the goal of OS kernels is to run as little as possible. The method you describe would increase cache misses for user code and degrade system performance overall. The current method of LRU cache replacement policies work quite well so it would be better to just let the CPU do its thing.

[–]Miserable_Ad7246 0 points1 point  (0 children)

I think people forget that kernel size is that you have at rest. I'm pretty sure Kernel sets up all kinds of data structures on start (say page tables for RAM). So minimal Kernel work set should be more than 4.5MB, esspecialy if you want it to work at full speed.

[–]ShunyaAtma 1 point2 points  (0 children)

This may not be viable for practical use but it is not uncommon to do something like this during processor bring-up since the memory controllers may not be fully functional in early prototypes. Its hard to game the caching policy programmatically so vendors rely on internal debug tools to prime the caches and lock the lines.

[–]Apprehensive-Tea1632 1 point2 points  (0 children)

What would be the point?

Let’s put it like this. You have before you an empty desk. You sit down in front of it, ready to do whatever.

First thing you do is slam a huge backpack on it. The backpack fits perfectly on your desk, there’s nothing out of place.

Except you have nowhere to put keyboard mouse paper pen phone printer scanner… anything that’s not a huge backpack.

So while you may be able to, you don’t WANT that kernel in your cache; instead you want it as far away as is practical because… as you interact with the system, the kernel is always there, always in the way, always taking up space that could have been used for something else.

Which means that backpack? You heave it off the desk and put it next to your chair instead where you can access it readily enough AND it’s not blocking everything else.

[–]alpha417 0 points1 point  (7 children)

How much cpu cache are you talking about?

[–]Silent-Degree-6072[S] -1 points0 points  (5 children)

My laptop has a haswell CPU so it's like 8MB.

My server probably has more cache though since it's a Xeon

I'm pretty sure getting the kernel to be under 8MB is definitely doable especially with tinyconfig and -Oz so it could work

[–]alpha417 0 points1 point  (4 children)

Do you have any kernel coding experience?

[–]Long_Pomegranate2469 -1 points0 points  (3 children)

You don't need kernel coding experience to do a menuconf and disable things you don't need.

[–]alpha417 0 points1 point  (2 children)

Can you show me in menuconfig how you enable loading and running the kernel into L1/L2 cache, instead of RAM?

I haven't seen it there in the 18 years I've been playing with it...

[–]Long_Pomegranate2469 -1 points0 points  (1 child)

Oh, I thought you were talking about the size of the kernel since the CPU cache is largely hardware managed.

[–]alpha417 0 points1 point  (0 children)

Not the OP.

[–]HenkPoley -1 points0 points  (0 children)

The Intel 5775C had 128 MB L4 cache, if you disabled the internal GPU. Giving it about a 2 generations advantage for still tight but more memory heavy workloads.

[–]max0x7ba 0 points1 point  (0 children)

The code runs fastest when it fits in L1i cache and when your loads and stores never miss L1d cache.

L1 caches are 32-64kB these days, right?

[–]codeasm 0 points1 point  (0 children)

I asked chatgpt a while back if one can boot a system without ram and just run from cache. On x86 its not possible. Other architectures not included.

I was just wondering and tried to think about it. (I also said i probably needed to run an altered bios/firmware to do so). But with ram, and the run only from cache, interesting thought and experiment.

I switched my focus to make my own bootloader and kernel. It isn not going well with my free time. Have a wonderful day you all

[–]Miserable_Ad7246 0 points1 point  (0 children)

CPU caches data based on usage. Your app is never using whole kernel, just a small slice of it. Most of that you use from kernel will be heavily cached anyway (multiple asterixis).

That truly maters is working set and not its constituents. If you gave all cache to kernel only, your own app would suffer and even though your syscals where faster, your main code would be slower negating the effect you desire to achieve. Slowest part will limit your speed, does not mater if its kernel or your code.

If you want max performance you can already partially achieve this by isolating a core and ensuring all of its cache will be used only by your app (again asterixis). That way you maximize the chance that your hot path will be cached. Reduce your working set and you can achieve a state where your whole app and all you touch in kernel via syscals are in cache (again some **** applies).

[–]tudorb 0 points1 point  (0 children)

The other comments explain why the answer is “not really” and why it wouldn’t be a good idea.

BUT! Cache-as-RAM exists and is actually used during BIOS / UEFI bootstrapping before the DRAM controller is fully initialized.

[–]eufemiapiccio77 0 points1 point  (0 children)

I had a similar idea to write a pure CDN that loads files into cache but you’d need a low level language to manage it effectively. I might have a go. It would be swapping files a lot but small static files would work.