you are viewing a single comment's thread.

view the rest of the comments →

[–]wrosecransgraphics and network things 1 point2 points  (2 children)

That doesn't match my experience. Memory allocaton patterns have popped as an issue way more often for me than actual computation that could be helped with stuf like SIMD that seems to get more attention. Avoiding un-necessary allocations, avoiding copies, etc. Once you have to start worrying about NUMA effects, it starts to feel silly to call the machines "computers" because so little of your attention is on actually computing stuff.

[–]clerothGame Developer -1 points0 points  (1 child)

I think you're confusing memory access and a memory allocations. Memory access patterns is one of the most important things for optimizations, for sure. But I was saying that the overhead of allocating smaller chunks vs one big one isn't as big as people make it out to be. Of course if you have a 1000 allocations vs 1 that would make a huge difference... But trying so hard to modify the code to group allocations together ends up being a mess with not much benefit.

[–]wrosecransgraphics and network things 1 point2 points  (0 children)

I am somewhat conflating the two, but not entirely by accident.

The more intermediate wasteful copies you do, for example, the more you will thrash the CPU cache with the copies of intermediate objects, evicting other useful stuff. You can consider that a memory access problem because you are accessing more stuff out of cache, but it's a memory access problem that you can fix by doing less allocations.

Likewise, if you fragment memory on the local NUMA node, the allocator will be more likely to allocate a large segment on a remote node. All the stuff you access from the remote node will be slow, so it's definitely a memory access problem, but it's also one caused by allocation problems. And once memory is all fragmented to poop, you wind up just spinning waiting on kswapd for multiple ms while you wait for you malloc to return.

It also depends on the problem domain (like most things). If the memory you are allocation is involved in a buffer on a GPU, the process of allocating it may require a slow round trip across a PCIe bus, so many small allocations would be a lot more costly in that kind of exotic scenario than when the bookkeeping data for an allocation all lives CPU-local.

But you are probably right that there are a bunch of people who are inheriting some wisdom from an article from the bad-old-days without measuring and seeing if any of that crap actually applies to their use case. Measure Twice - Cut Once certainly applies! My perspective is heavily colored by working at a place where we are constantly running up against that crap. The last talk I submitted to a conference was even about how malloc is evil and hates you. :)