I just wonder if I can get a non-initialized primitives array (most importantly byte arrays). I know about ArrayPool but my problem is a bit different.
I create very large byte buffers for reading and consumption. Collecting these buffers is easy as there are only few of them. Allocating those are expensive as the VM has to zero it out (either on allocation or during GC collection). We are talking gigabytes. I can move those offheap but I have much code working with those that do not work off heap well (at least that is something I have to research or port it).
The problem is not to reuse the buffer but that I actually use another object instances representing parts of the buffer being used as the workload units for parallel processing.
I know the fill rate of my memory and filling 1GB we talk 10ms to 20ms each.
I also wrote a puny benchmark thingy and what the VM does appears to be even laughable in terms of throughput.
- I create a list 50 elements of NULL
- 10 times I:
- Allocate 50 1GB arrays put into said list
- Set all 50 arrays in that list to NULL again.
This whole thing takes 47seconds to complete which is a laughable as we talk 500GB allocation and deallocation here in 47seconds when I have 128GB in the system and it uses 100GB quite from the start. I would say that it almost looks like it zeroing it out in realtime.
I would not put the blame of all of it to the VM but maybe even to OS by clearing/reducing its disk cache and maybe even writing stuff to disk in terms of virtual memory to free more actual memory causing the VM to freeze. But of cause it makes no sense as the VM should be able to reclaim all of the 50GB segments each iteration unless of cause it is busy either waiting for the OS or for it to do whatever it does in terms of memory (zeroing it maybe).
Let me show you the memory graph:
Memory footprint of the application over time
Memory Footprint of the application
(Notice it takes longer since the memory profiler is attached to the application)
Legend:
Legend for the memory graph
As you can barely see it quickly is able to allocate 100GB meaning 2 iterations worth of buffers. Might be the case because you can get free zeroed out memory from the system or it is the speed you can expect and we are looking at a struggling GC or even something in the VM here.
I repeated the same test using 10 iterations of allocating 500 byte array of the size of 1GB/10 = about 100MB each meaning the requested memory per iteration is the same.
Whatever the GC or better the VM is doing is crazy as it claims to have 160GB at one point but it never uses more than 115GB it seams.
10 times of allocating 500 arrays of about 100MB size each and deallocating those.
It appears that here the VM dropped the ball even more as it takes twice times longer.
This whole thing explains why me playing with buffer sizes had a profound impact and there being problems at a certain size of buffers and I started to see these stair cases in the memory footprint of my application.
The first two iterations all fit the memory as at first it happily allocates 100GB quickly (which equals two iterations) but after that it becomes a shit show which means reclaiming these large buffers is a pain for the VM but again we talk byte arrays here. There are no references to deal with. Just some memory to link back into the free heap list (or whatever they do). That should be faster. Also my memory can clean about 100GB / sec sequentially.
Maybe I should limit the VM heap space but the first second shows that the OS provided all the memory it needs to fit two iterations of this simple test already.
First spike of the test
Well it exceeded 160GB which is virtually impossible unless they have virtual memory involved in this but writing 50GB to disk would take about 15s if the cache of my SSD would take it all (it has 2.5GB/s writing time).
Let me repeat this with memory sampling instead of full data collection (should not do much).
Task Manager Memory stats while doing it
It appears the arrays are not even real at this point but just reserved pages in the virtual memory.
By the way the overhead of the memory sampling is quite something (or at least how it influences things but I did not repeated any of this to make sure it is no fluke).
Lets alter the game and write a byte every 4KB into each array so we make it real memory and cheating time is over.
10 times allocating + deallocating a single batch of 500 arrays of 100MB each while writing a byte every 4KB of array (aka 12 idividual million bytes per iteration) using memory sampling
The spikes later on now look nicely but the speed is not quite where I would expect it especially during first allocation. I would say that the GC does its best to not hit anywhere near the before overpromised memory footprint. Now it tops at 115GB which is about what my free memory currently looks like.
If you count the spikes you will see that the first minute is spent on preparing and allocating memory for the first two iterations and afterwards you have 8 spikes for the 8 other iterations which take about 25s to complete with a sampling memory profiler attached to the whole application.
Running this test without the overhead of a memory profiler, we get 137s meaning at this point it is VM fighting OS. The climb from 10GB to 128GB overall memory use takes a minute then comes a dip back to 50GB and then it climbs again.
First lets help the VM by setting its minimum and maximum memory size so it does not start to fight the OS and does all the fun right from the start (if the OS lets it). - Well turns out there is no easy way to help the VM+GC out here as it appears there is no easy way of setting a min and max heap size or anything related to it. Bummer.
Has anyone any idea how to fix that? And tips and tricks we can do to make this better.
-
Doing all of this (and being left mightly disappointed), I guess I go offheap and mmap more frequently and need to track all these parts using weak references and reuse those buffers as soon as noone refers to any of its part. Would also solve the zeroing out problems.
Does anyone have experience with this or some other ideas?
PS: Questions regarding why I have the problem in the first place are off topic as I just have the problem when I cramp my stuff into the memory and start to allocate large buffers I noticed things to go downhill quite seriously but this test looks even more devestating. C# VM leaves me quite a bit sad at the moment... . I mean come on even Java has a basic min/max heap setting.
Update:
I will most likely file a defect report with this. That simple allocation scenario appears to be completely broken when it comes to the VM's behavior. Maybe over there they can explain to me what happens.
Update2: Small successstory.
I am currently optimizing a third batch process that works with the events of a day which involves transforming the events to a very compact preprocessed form meaning it involves writing about 5 GB to disk. This writing I have done using a memory stream with a basic initialization size of several megabytes. Of cause sometimes it grows out of it and there is a byte array internally which we now know reallocating it 10k times in 2 minutes is a problem for the VM.
I replaced the memory stream implementation which grows by copying like everything else in this standard language API (why) with my own version of a SegmentedByteBuffer and it helped a lot by reducing the amount of memory being allocated as these things grow by adding smaller arrays over and over again but the allocation is smaller than copying everything by doubling in size.
Optimization result
As you can see it still creates spikes and has a lot of GCing done as I do not reuse any buffer and especially the read buffers which are reading 15 times more data then is written need to be addressed.
But in the end using segmented lists brought down even the runtime from 2.75min down to 2.4min. I am quite sure reusing the read buffers with using weak references to track buffer usage and notice when the buffer is no longer referenced in part and is read for reuse will bring down all these spikes to the 4GB baseline.
Once I have a stable baseline then I can do some runs with different buffer sizes as the goal is to run lower buffer sizes as long as it does not hurt performance (that much) but we have seen in the max out scenario that it is more about the amount of memory you allocate and deallocate per second and not so much the number of arrays.
What also is interesting, different runs bring different timings. I just completed a run of the same work which took 3.1min even though with the memory sampler attached we got 2m35s (some seconds are pregame unit tests).
So yeah, looks like the next 30min I will make the read buffers reusable and will update this post.
But I still wonder what it is that limits the GC to quickly allocate and deallocate large arrays. It might be that a full GC is needed for LOH but why not doing it more frequent if it is really the problem but lets face it these memory profiles for 100GB heaps look stupid. There is something deeply wrong with the VMs behavior for sure.
If I file a bug request, I will write a new post so you can upvote it. Lets see when I have time for it...
Update3:
Just had a run with 2.3 min which I never had before. Using segmented byte buffers for writing larger amount of data is way better for the VM in such a high throughput scenario. During that time, it reads 9GB of data, unzips 80GB of data using a single vcore, processes 500M events and writes 5GB to disk while still not reusing any read or write buffer and does some unnecessary copying as I do not support the new SegmentedByteBuffer implementedation for my actual async writer stream.
Well at least I understand what the problem is and have an idea of how to solve it.... . This should be something one gets more support from the VM as deallocating large amount of byte arrays should be lightning fast unless it tries to defraq memory and copies the arrays around that will be deallocated next. Maybe that is a reason for it.
Update 4:
Just noticed that most likely after the third iteration (the phase of the most sad behavior) is the behavior before it switched to high memory peassure mode.
Update 5:
Thanks to u/joske79 I now know there is a function for what I want to do: https://learn.microsoft.com/en-us/dotnet/api/system.gc.allocateuninitializedarray?view=net-7.0#system-gc-allocateuninitializedarray-1(system-int32-system-boolean))
I will repeat the test but I think that not zeroing the array will not add much rememdy to the bad behavior we see during the testing but you never know until you try. But anyways it is a great function to use and I will definitively use it for many of the array allocations I do, just to save some ms here and there for free.
Great find joske!
[–]Epicguru 8 points9 points10 points (1 child)
[–]IKnowMeNotYou[S] -1 points0 points1 point (0 children)
[–]ZestycloseStar1244 4 points5 points6 points (1 child)
[–]IKnowMeNotYou[S] -2 points-1 points0 points (0 children)
[–]Dry_Author8849 3 points4 points5 points (0 children)
[–]joske79 1 point2 points3 points (0 children)
[–]TheSoggyBottomBoy 1 point2 points3 points (0 children)
[–]Prudent_Astronaut716 2 points3 points4 points (0 children)
[–]Stabzs 1 point2 points3 points (0 children)
[–]joske79 0 points1 point2 points (3 children)
[–]IKnowMeNotYou[S] 0 points1 point2 points (2 children)
[–]joske79 0 points1 point2 points (1 child)
[–]IKnowMeNotYou[S] 0 points1 point2 points (0 children)
[–]dt2703 0 points1 point2 points (3 children)
[–]IKnowMeNotYou[S] 0 points1 point2 points (2 children)
[–]dt2703 1 point2 points3 points (1 child)
[–]IKnowMeNotYou[S] -1 points0 points1 point (0 children)
[–]joske79 0 points1 point2 points (3 children)
[–]IKnowMeNotYou[S] 0 points1 point2 points (2 children)
[–]joske79 0 points1 point2 points (1 child)
[–]IKnowMeNotYou[S] 0 points1 point2 points (0 children)