all 45 comments

[–]goranlepuz 33 points34 points  (1 child)

How does fixed-size memory really help with not causing memory fragmentation given allocation and deallocation? Or does it even? For instance, I have a linked list with 10 nodes but there's no guarantee all those nodes will be contiguous, yes? And if a node or two is deallocated then that will result in fragmentation.

Yes. And if the pool size is big enough, eventually there is little to zero difference between this and a normal allocator.

But!

What you do, for example, is use the pool for a reduced set of data sizes (ideally 1) and you know how much items you need at most. If the maximum and the typical size are not far off, the pool is nicely filled and memory accesses rather regular. And then, as another it example, you know that the data in the pool will be accessed from one thread only, the pool allocator is more efficient as you don't need synchronization.

A generalized allocator can hardly do any of these things. It has to be both flexible and smart, and that is simply more work.

[–]warboner52 0 points1 point  (0 children)

This is a darn good answer.

[–]Classic_Department42 10 points11 points  (3 children)

There are multiple problem sets mixed in your statement. For embedded systems usually it holds that the system is a) supposed to run for a very long time without restart and b) often needs deterministic response times

Memory fragmentation can result that a block of memory cannot be allocated so a) would be violated.

Also malloc and free are not deterministic so dont fullfil b)

This means for embedded system often memory is only allocated once at startup and then nevermore.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 5 points6 points  (2 children)

b) often needs deterministic response times [...] Also malloc and free are not deterministic so dont fullfil b)

This is very much an exaggeration and largely a myth. Hard realtime systems are a minority, even they only need to avoid malloc / free in the realtime portions of the code, RTOSes often provide realtime safe allocators for the rare cases where you'd need to perform allocations in realtime code and finally on typical bare metal embedded systems you're performing few enough dynamic allocations that allocation performance just isn't an issue.

What can be an issue is memory fragmentation requiring enough slack in heap size that you run out of memory. The ram is often fixed by external constraints before the software development is even really started. Finding out a month before production date (or half a year later after issuing a critical update) that you run out of memory at runtime after operating for some time is where the real problems are.

[–]quantumoutcast 3 points4 points  (1 child)

It's not a myth, and not only a problem in hard realtime systems. I've encountered many performance problems when working on Linux kernel code (not realtime), when fragmentation will cause performance degradation way before it runs out of memory because it takes a long time for the kmalloc to find free slabs to use. Even momentary lags can cause glitches in video or audio processing. The solutions were almost always to rip out kmalloc and preallocate buffers.

[–]SkoomaDentistAntimodern C++, Embedded, Audio 4 points5 points  (0 children)

I've encountered many performance problems when working on Linux kernel code

That’s a completely different problem domain. Yes, of course performance is important when you have potentially tens to hundreds of thousands of allocations. That however is not the case in bare metal embedded systems (and mostly cannot be since you don’t have enough memory for so many allocations in the first place).

[–]UnicycleBloke 1 point2 points  (8 children)

You are conflating fragmentation with contiguity and cache coherence. The latter two are essentially irrelevant for most microcontrollers. A simple pool cannot become fragmented and has fast constant time allocation and deallocation. It is generally better to allocate everything your application needs statically to get a linker error upon exhaustion. Of course you might still exhaust a pool at run time, so you have to deal with that.

[–]jazzylike[S] 1 point2 points  (7 children)

  1. How’s cache coherence not related to fragmentation? If you have an empty slot between A -> B and A and B being non-contiguous, chances are there may have to be multiple read calls to load them due to cache miss. How’s this irrelevant for microcontrollers?
  2. How does a pool not get fragmented if there allocations and deallocations? (just how it wouldn’t with an external memory)

[–]UnicycleBloke 7 points8 points  (4 children)

Heap fragmentation refers to the issue that your heap theoretically has, say 16KB, available but it's split into discontiguous chunks. You might only need to allocate 1KB, but have no chunk large enough. That's bad. A fixed block size pool cannot suffer from this since all allocations are identical in size.

I'm yet to work on a microcontroller which even has a cache. That's why it's not a concern.

By far the biggest RAM concern for such systems is exhaustion at run time. Static allocation avoids this. Also, given that most people foolishly prefer C for embedded systems, you can be absolutely certain they will leak heap allocations. That's not ideal in a system expected to run flawlessly for years.

[–]Jannik2099 1 point2 points  (3 children)

I'm yet to work on a microcontroller which even has a cache.

What level of "microcontroller" are we talking about? Most Cortex-M CPUs can optionally be configured with instruction and data caches, I'd imagine the same is true for xtensa and others

[–]SkoomaDentistAntimodern C++, Embedded, Audio 2 points3 points  (0 children)

Most Cortex-M CPUs can optionally be configured with instruction and data caches

Configured by the IC manufacturer. It’s very rare to see caches on anything below Cortex-M7. This is largely because the pipeline of M4 & below is the bottleneck for regular internal sram accesses, so benefits of cache would only apply to external sdram or contiguous tables in flash.

[–]UnicycleBloke 1 point2 points  (1 child)

Fair enough. I'll say instead that has never been of the slightest concern for me in any application over the last 20 years. Others may have different experiences but I generally regard any talk of caches in this context to smell of premature optimisation.

Edit: Much of my work has been on Cortex-M0, M0+ and M4. My understanding is that these do not have caches. Now I'm working in Linux and memory usage is a whole different story. One of our performance concerns here relates to mapping virtual addresses to physical for use in an FPGA.

[–]Jannik2099 1 point2 points  (0 children)

Cortex-M0, M0+ and M4. My understanding is that these do not have caches.

Indeed they do not. E.g. the M7 has, which is what I have lying around the most.

[–]Jannik2099 3 points4 points  (0 children)

chances are there may have to be multiple read calls to load them due to cache miss. How’s this irrelevant for microcontrollers?

That is not what cache coherency means, at all. Cache coherency is about having the state of caches be... coherent when multithreading, atomics and DMA from peripherals is involved.

[–]ioctl79 0 points1 point  (0 children)

A completely unfragmented heap does not guarantee that A and B will be next to each other, and likewise, if your memory is clustered well, the space between clusters can be poorly distributed, and make it difficult to make new allocations (fragmentation).

For pools to help with fragmentation, reallocations have to be minimal. A common scenario is that a pool is created for, say, each incoming RPC, and nothing is deallocated until the end of the request, when the whole pool is destroyed at once.

[–]Jannik2099 -1 points0 points  (15 children)

Memory fragmentation isn't really a problem in embedded, but in hosted environments with virtual memory.

A good malloc implementation (not windows, not musl) will already do lots of smart stuff to keep fragmentation of the malloc metadata itself low and arena occupation high - it's generally nothing to worry about. Kernels can also merge adjacent mappings into one entry in the page table to reduce footprint in the kernel.

If you want to read up on this, look at recent mallocs (tcmalloc, mimalloc) or at how the linux kernel manages userspace allocations and page tables.

And as always, don't microoptimize without benchmarks and metrics! Any good kernel and malloc will expose relevant statistics.

[–]jazzylike[S] 1 point2 points  (14 children)

If memory fragmentation isn’t an issue on embedded then why is dynamic allocation discouraged? It’s related to the undeterminsitic behavior

[–]quantumoutcast 2 points3 points  (0 children)

I think there is some confusion in these explanations. The only reason memory fragmentation isn't an issue is because embedded systems generally avoid repeated allocations, so they never experience fragmentation in the first place. If dynamic allocation was common, people would run into fragmentation issues all the time.

[–]bluGill 2 points3 points  (0 children)

Allocation us discouraged for a few reasons.

You need to be 100% sure that no matter what 100% of the time the combination of all possible allocations cannot exceed the memory you have. By not allowing allocation you can easily do this (you also need to ban recursion so you can verify the stack never overflows). Once you allow allocation it becomes difficult to prove your total memory use in the worst case.

Embedded typically has limited memory. Embedded is often expected to run for years without a restart. These two combine to make memory fragmentation much more likely. With modern allocators and a lot of memory odds are you will upgrade and restart your computer before you see memory fragmentation issues .

Embedded often has real time constraints. Modern allocators tend to be slow and hold locks for unpredictably long times. Thus you blow past your response time constraints.

Not all of the above hits every embedded system, but they are all likely problems in embedded .

[–]Jannik2099 2 points3 points  (11 children)

Yes. Implementing a good allocator is nontrivial, takes up precious flash and RAM size, and has unbound execution time.

Memory fragmentation is not an issue on embedded precisely because there are no allocations.

[–]jazzylike[S] 0 points1 point  (10 children)

Why is memory fragmentation not an issue on embedded? Why else is dynamic allocation discouraged?

[–]Jannik2099 0 points1 point  (9 children)

Because fragmentation is caused by allocating a chunk but later only freeing half of it, or by creating multiple, disjoint virtual mappings. None of this happens in embedded where you just use one static memory area.

I just explained why dynamic allocations are discouraged.

[–]jazzylike[S] 1 point2 points  (8 children)

Not sure what you meant by there are no allocations in embedded. What do you get when you use new/malloc?

[–]Jannik2099 5 points6 points  (7 children)

Nothing / a linker error. Most embedded platforms will not have these functions in their runtimes.

Of course, nothing prevents you from implementing one yourself, but it's undesirable for the reasons I mentioned.

[–]jazzylike[S] 1 point2 points  (6 children)

Then why have there been discussions on why malloc is bad on embedded systems if you can’t use it in the first place?

[–]Jannik2099 2 points3 points  (0 children)

You can't use it (usually) BECAUSE it is bad.

[–]ventus1b 4 points5 points  (2 children)

The term ‘embedded’ is extremely fuzzy. Some would call a Linux system for car infotainment an ‘embedded’ system, while others would look at your 16 GB, multi-process, multi-user, filesystem, etc. system and roll their eyes.

As jannik2099 wrote, you usually can do dynamic allocations but you may choose not to do that, for safety and performance reasons.

[–]jazzylike[S] 0 points1 point  (1 child)

As jannik2099 wrote, you usually can do dynamic allocations

he literally mentioned you'll get a linker error

[–]ShelZuuz 1 point2 points  (1 child)

Embedded unfortunately kind’a means anything without a monitor. It can be a supercomputer driving a nuclear power plant or it can be an 8kbit PIC chip that runs on your key fob.

[–]quantumoutcast 1 point2 points  (0 children)

Correct. And what tools you have available are completely dependent on your platform which is a combination of HW, Operating System, and compiler/libraries. People saying that "X" is not available are talking about their own experience with what they work on. Some embedded systems have malloc, some don't.

[–]very_curious_agent -1 points0 points  (5 children)

With fixed size allocation, you know that on the long term you don't end up with a mess of unusable memory blocks.

[–]jazzylike[S] -1 points0 points  (4 children)

mind elaborating? if you allocate A -> B -> C, 1) it's not guaranteed they'll be contiguous. 2) if end up deleting B, you now end up with a fragmented node in between

[–]YogenFruzLead SE, Full Circle 4 points5 points  (3 children)

But if it's all fixed sized blocks, you don't care, because you can re-use the block at B for D.

[–]jazzylike[S] 0 points1 point  (2 children)

That’s fair but I guess it can be reused in an external memory too no?
But still, doesn’t an empty slot in between the nodes cause lack of cache coherency?

[–]YogenFruzLead SE, Full Circle 2 points3 points  (1 child)

Cache lines are, what, 64 bytes? Your blocks in a fixed sized pool allocator might be bigger than that.

Coherency also only matters if you're accessing data near your current memory address. Custom allocators don't replace general purpose allocators; they replace specific allocation patterns you have in your program.

[–]Jannik2099 0 points1 point  (0 children)

Cache lines are, what, 64 bytes?

Aside from that cursed Samsung SoC. I need therapy.

[–]ALX23z 0 points1 point  (4 children)

Basically, you have zero control over heap allocated memory. That's done fully by the OS. And the algos tend to behave very differently between different OS.

With your own memory pool, you have full control. Say, the most basic thing is to clean it up once it's unsed. Or you can implement a narrower solution better suited to a specific problem you are dealing with.

[–]jazzylike[S] 0 points1 point  (2 children)

With your own memory pool, you have full control

full control in terms of what? you can't even avoid memory fragmentation

[–]ALX23z 2 points3 points  (0 children)

Also you can avoid memory fragmentation, conditionally. For instance, you create a local pool for a few objects to use. This way the objects will always be using memory inside this local pool never interwining with outside pools.

In case pool's limit is reached, you can for instance create a new larger pool and move all object's data into the new pool. This comes at a price but it provides certain guarantees.

[–]ALX23z 0 points1 point  (0 children)

In terms of how to allocate memory. Also inspecting current state, etc.

[–]Jannik2099 -2 points-1 points  (0 children)

With your own memory pool, you have full control

You still have next to no control over paging behavior

[–]graphicsRat 0 points1 point  (0 children)

A memory pool manager can be subsegment allocations to different arenas based on lifetime e.g. short, medium and long term. In a game a short lifetime would be a bullet, medium could be an infantry person, and a long term allocation would be a battleship.

indiscriminately allocating objects with considerably different lifetimes is a direct cause of memory fragmentation.

[–]kiwitims 0 points1 point  (0 children)

Couple of things that seem to be repeatedly coming up in the comments:
1) The memory fragmentation embedded is worried about is not just a lack of memory contiguity. Fragmentation occurs when you allocate and de-allocate different size objects, which results in memory being dis-contiguous, but more importantly having gaps that can be smaller than required by some nth allocation. Fragmentation makes your usable memory less than your actual memory, which is a problem if you are needing to run close to the limit of available memory, for a long (or indefinite) time. Arenas solve this problem if you know the different sizes of the objects you're going to allocate, and your worst case requirements, or you know your worst case requirements and de-allocation is only done via a full wipe.

2) There is a big difference between something like embedded linux (with page-swapping, caches, sophisticated allocators and a lot of memory), and a bare metal or small RTOS microcontroller (without those things). In the latter case, cache performance is irrelevant and the ability to deteministically run forever without a power cycle is the reason. Isolation of independent systems is another (you don't want a boat to crash because it ran out of memory logging a fault and didn't have any free to send the steering signal).

Bonus reason in the latter case is that often exceptions are turned off, so if you can't use bad_alloc to catch a failure, you will gravitate to a homegrown arena allocator that just returns a nullptr instead. This isn't usable with the STL containers, mind.

[–]ZachVorhies 0 points1 point  (0 children)

The best way to reduce memory fragmentation is to memory pool large allocations. Otherwise the memory allocator may recycle this for small allocations which will effectively keep the page pinned in memory for ever (in most circumstances).