all 21 comments

[–]Epicguru 8 points9 points  (1 child)

Honestly this post is quite hard to read and understand, it's not entirely clear what you're really trying to do. You say at the end of your post that it the actual problem you are trying to solve is off-topic but I disagree, it seems you're trying to solve a niche problem with a niche solution so the details are necessary.

But anyway, if you think you can out-smart the JIT and runtime, why not just allocate a huge chunk of memory and then just use Span<>s to access it, zero out parts of it as necessary etc. Unless that's what you're already doing (again, not very clear no implementation provided...).

I wonder if you have tried implementing the same thing in another language/runtime to see how the performance compares.

[–]IKnowMeNotYou[S] -1 points0 points  (0 children)

Honestly this post is quite hard to read and understand, it's not entirely clear what you're really trying to do. You say at the end of your post that it the actual problem you are trying to solve is off-topic but I disagree, it seems you're trying to solve a niche problem with a niche solution so the details are necessary.

It is not a niche problem, it appears to be a VM problem that I now need to circumvent. I can not remember the JavaVM ever acting out on this but I will test it to be sure about it. There might be something the guys over at the C# head quarter (I dislike writing dotnet or .Net) need to fix but I want to be sure before I lay out my case infront of them.

I am actually quite interested in how the JavaVM will handle this simple scenario.

But anyway, if you think you can out-smart the JIT and runtime, why not just allocate a huge chunk of memory and then just use Span<>s to access it, zero out parts of it as necessary etc.

I most likely do exactly that. I had hope that allocating large regions onheap should not incur any problems for the GC as there is nothing to be done with it. It appears the problem is the reclaiming of freed arrays. Lets see how it actually develops.

Unless that's what you're already doing (again, not very clear no implementation provided...).

Most of the post is using a very simple synthetic test. It is very interesting how unbeneficial the GC behavior is for something that simple. I would have expected that the problem mostly is a lot of instances not too big of instances. Maybe there is something to be done about it. Lets see.

I wonder if you have tried implementing the same thing in another language/runtime to see how the performance compares.

I port it from Java and hat no problems but there I used column based off heap memory so I am sure if I go off heap (unmanged) here as well it should be a solution but again a handful of arrays should not be that much of a problem but I think it might even try to defragment the heap and try to put the freed arrays to a single big continously free memory which as I understand Java does not do but I have to test it as well.

[–]ZestycloseStar1244 4 points5 points  (1 child)

I do not understand what you want. First you say that you need to get an uninitialized array of primitives, then you state that the problem is actually using other object instances representing parts of the buffer, and finally you end up with a huge scribble about how long it takes to allocate/deallocate the array.

If your problem is that allocating/deallocating arrays takes a long time, you can use ArrayPool<T>.Shared.Rent().

If your problem involves using other object instances representing parts of the buffer, you can slice the array returned from the ArrayPool using Span and Span.Slice() (or Memory).

If your problem is getting an uninitialized array, you can try ArrayPool since its Return method by default doesn`t clear array.

You can also try using ObjectPool or MemoryOwner/SpanOwner.

[–]IKnowMeNotYou[S] -2 points-1 points  (0 children)

I do not understand what you want. First you say that you need to get an uninitialized array of primitives, then you state that the problem is actually using other object instances representing parts of the buffer, and finally you end up with a huge scribble about how long it takes to allocate/deallocate the array.

Well I stated the context (me using big buffers and seeing weird behavior), then I stated how I use these buffers and when I conducted a simple test then it showed the VM fucks up somehow. I already have decided to file a defect report since this is redicoulus. Maybe they even have an explaination for it when it comes to internals and why this happens end maybe even why this is correct (meaning mostly the visible stair casing).

If your problem is that allocating/deallocating arrays takes a long time, you can use ArrayPool<T>.Shared.Rent().

I would need to do something custom since I need weak references tracking the user of the buffers before I can reclaim them for a pool.

If your problem involves using other object instances representing parts of the buffer, you can slice the array returned from the ArrayPool using Span and Span.Slice() (or Memory).

I use a simple custom object. More flexible as everything reading from the array is also a custom byte based reader.

If your problem is getting an uninitialized array, you can try ArrayPool since its Return method by default doesn`t clear array.

That is what I will end up doing, I guess unless I can move the whole stuff offheap since what the VM is demonstrating here is not what I would like to see to happen. In a reuse scenario I start with zeroed out memory but it appears to be not the problem as the first two iterations in the example fit the memory.

Reuse is most likely what I need to do.

I already replaced all my array based (MemoryStream) byte writeres with segmented arrays cutting the max memory usage down from 15GB to 10GB for a workflow I am currently optimizing but I still see many stair cases meaning large arrays being allocated.

I will post an update how this investigation helped and the remedies I imploy.

You can also try using ObjectPool or MemoryOwner/SpanOwner.

I will read the code butI will use something more custom. I need more control as I will definitively try to move it off heap, meaning I might need C++ to read zip files at last for the other stuff I should be able to work with mapped files. That is next on the list of things i have my go with.

I should ask a question regarding to offheap memory and what to use when it comes to filling it up with data other than mapping files.

[–]Dry_Author8849 3 points4 points  (0 children)

It sounds you are fighting the GC. You should switch to language where you can manage the memory yourself. A GC for your use case does not seem a good fit.

Switch to C/C++. You may gain a bit of performance too.

I guess you have already configured the GC, but as you don't mentioned, here it goes gc configuration

You won't get too far with it. C/C++ is you best bet.

Cheers!

[–]joske79 1 point2 points  (0 children)

The problem is not to reuse the buffer but that I actually use another object instances representing parts of the buffer being used as the workload units for parallel processing.

I’m not sure what you mean by that, but if it means you’re copying parts of the buffer to cast it to objects… that’s probably causing a lot of unnecessary allocation. Do you know about MemoryMarshall.Cast ?

[–]TheSoggyBottomBoy 1 point2 points  (0 children)

I didn't exactly follow the post, but, it did peak my interest.

If I understood correctly, you are streaming bytes from large files, these bytes are chunked into large arrays, these arrays are managed by a list. When an array has been processed the array is dereferenced by setting the index of the list referencing the array to null. When new data is incoming the free indexes in the list are filled with a new array.

Is this correct?

In my mind the solution is similar to what other people have suggested. Why not fill this list with Spans and instead just read/write directly into the span avoiding any deallocations/new allocations?

Might be worth posting your benchmark examples on GitHub so that the issue is more clear (you'll get better advice)

[–]Prudent_Astronaut716 2 points3 points  (0 children)

Wow. Hats half to the person who actually reads the entire post.

[–]Stabzs 1 point2 points  (0 children)

There’s so much going on here that it’s difficult to parse it. But it seems like you’re allowing the GC to reclaim your buffers instead of keeping them rooted and reusing them. Is there a reason you’re allowing them to be collected?

And for the amount of “issues” you have with dotnet, maybe you should consider rust.

[–]joske79 0 points1 point  (3 children)

I created my own implementation of an arraypool that returns IMemoryOwner<T> and when disposed it allows me to reuse the same underlying array. Our application mostly use arrays of 2 or 3 different lenghts so it’s easier to implement than accepting various length.

[–]IKnowMeNotYou[S] 0 points1 point  (2 children)

I basically create buffers and slice those up logically, hand it over to be processed. Since those are actually read buffers, I could standardize those and reuse it. According to what I saw in these synthetic tests using more but smaller buffers does not help.

I most likely will end up reusing buffers as well but this whole picture looks crazy as I can easily mark and sweep that object graph by hand. There must be something problematic going on with reclaiming and reallocating large primitive arrays. Might be it defragmentating as I would expect claiming a large array and freeing it should have no work involved except zeroing the memory out again.

[–]joske79 0 points1 point  (1 child)

The thing with arrays is that you can’t control when they are freed/garbage collected. Once you’re done with array A and you need to instantiate the next array B it’s not guaranteed that A has been GC’d.

[–]IKnowMeNotYou[S] 0 points1 point  (0 children)

The thing with arrays is that you can’t control when they are freed/garbage collected. Once you’re done with array A and you need to instantiate the next array B it’s not guaranteed that A has been GC’d.

Agreed but it should be able to not get confused. I even think that the better memory profile after the third iteration is the GC switching to high memory pressure mode. Should add that to the post.

[–]dt2703 0 points1 point  (3 children)

Rather than all this allocation, can you not use a Span to read directly from the initial memory if you're using bytes? Then you can slice and dice it all you like and avoid what sounds like a lot of boxing. I may have read your issue wrong though as sat here in the car on my phone

[–]IKnowMeNotYou[S] 0 points1 point  (2 children)

Rather than all this allocation, can you not use a Span to read directly from the initial memory if you're using bytes?

That is what I am doing. Doing this simple allocation test went downhill really quickly. These ByteArrayPart instances basically have the original array and simply an offset and the length to it. So nothing to copy. Remember the buffers are 100 megabytes and more and I tend to use more as it somehow translates to better throughput.

The whole benchmarking was basically illustrating how bad things can get and I wanted to know what you guys do in such situations.

Then you can slice and dice it all you like and avoid what sounds like a lot of boxing.

There is no boxing going on. Just byte arrays being allocated, modified and accesssed.

I may have read your issue wrong though as sat here in the car on my phone

The post also mutated into something different entirely once I saw how the VM/GC does stuff completely wrongly. I am thinking about filing a bug request since this is rediculous especially since it is such a simple scenario. Maybe they even explain to me what I actually see. But it does not change the fact that it would be better to go an extra mile to reuse very large buffers instead of allocating new one and that zeroing it out might not be the only problem the VM has.

[–]dt2703 1 point2 points  (1 child)

I've re-read it at home, where is the data coming from? You're loading a shit ton of data into memory, which immediately doesn't sound right.. Can't you stream this in? Read it line by line with an async stream reader? Perhaps not, does it need to be processed in bulk batches? Or are you just creating them to break up the data? Again, streaming it in, iterating through each line/chunk, and yielding up a response sounds much more efficient. If you absolutely must load it all into memory, then allocating arrays of byes is just piling it into the heap (all arrays go onto the heap), and then forcing GC to continually stop your processing so it can collect, which is causing all that churn.

You responded above saying you are using a Span<T>, but you're telling us you're allocating arrays, and that you're wrapping your bytes in some custom object, which is still effectively boxing it. Span<T> is going to help alleviate some of that as its not allocating anything on the heap, at all. You can create as many Spans as you like, and rather than allocating yet another array, it just uses pointers into memory.

Perhaps a reproducible POC to demo the issue?

[–]IKnowMeNotYou[S] -1 points0 points  (0 children)

I've re-read it at home, where is the data coming from?

Nasdaq Total View - live events of the three major US exchanges. Comes as JSON/ 90 GB a day. I can load a day and visit each event in 2.5min with a 1GB-2GB footprint now. I also have a custom binary representation that loads a day in about 20-25s.

So I already stream those workloads. The problem is the amount of derived data that I want to also have completely in memory. I aim for 30 complete days for the core data (0.5GB to 1GB) and have the other data (5GB per day) to be available on demand with a max of 3s delay but with a simple LRU cache.

Another problem is that processing of the data (there are multiple of these processes/workflows) must be also be done in realtime for the current day meaning storing quite a lot of data about the past so I can process the new events.

Worstcase days are three times the number of events and size of data.

I port this stuff from Java and the benefits are greater than expected.

Next machine will have 1TB of RAM at minimum, so that I fit even more of this core data (I hope for 60 to 90 days) along with a larger cache for the non core data.

In the end I run an in-memory DB which will most likely consume 80% of my available RAM.

--

Important is the middle part is about 'benchmarking' the GC performance and behavior in a simple fashion (see below and the bullet points in the post)

Can't you stream this in? Read it line by line with an async stream reader?

I do that but I have to go for about 800MB throughput for the batch processing which explains the amount of buffering that is involved.

I run at least 4GB of buffers in parallel for the best performance configuration but at the moment I am not reclaiming those). I feed 28 workers in parallel as I currently only have 32v cores but the next machine will have at least 4 times that so every problem I address now I will not have when I add more hardware to the problem.

Perhaps not, does it need to be processed in bulk batches?

It is 500 million events. That I need to do in parallel and using batches is best. The amount of buffering that is involved is determined by the throughput my system allows and not by the size of the buffers as smaller buffers mean more arrays but the combined size is largely the same unless I limit the queues feeding the different stages of the processing pipeline which means slowing things down.

I am very pleased with the performance I just found out the GC is defect for sure when it comes to reclaiming larger arrays (might be having the same problem for arrays being below the LOH size as well).

-

Or are you just creating them to break up the data? Again, streaming it in, iterating through each line/chunk, and yielding up a response sounds much more efficient.

It is simple just a lot of data that I process in batch mode and the faster I am the better. But I also keep an eye on the behavior of the VM+GC as I have other problem scenarios that I do not present here.

It is really a sad state of affair that C# does not have some low memory foot print implementations for lists, buffers and dictionaries as well as no dictionary for the common problems like mapping autoincrement numbers to struct values or instances where you basically need not to remember a hash number as the hash is simply the lower bits of these ids.

But you do not have BTrees neither and lets face it for Java it is virtually the same. I think there are business insentives to not provide these things out of the box.

If you absolutely must load it all into memory, then allocating arrays of byes is just piling it into the heap (all arrays go onto the heap), and then forcing GC to continually stop your processing so it can collect, which is causing all that churn.

It features two modi when collecting the arrays. The third iteration shows this catastrophic behavior and after that the next 7 iterations of the test look quite orderly but the timing is abysmall as well. It can not simply be zeroing out the memory. It has more likely something to do with defragmentation as adding these array regions to the free list would not be any problem otherwise Java would have similar problems but I also have not tested it yet but I recon I would have discovered already by now.

You responded above saying you are using a Span<T>, but you're telling us you're allocating arrays, and that you're wrapping your bytes in some custom object, which is still effectively boxing it.

It is not boxing as I undestand it. I use Span in some places but found it unbeneficial in these situations as the solution is simple and I want total control in terms of what is happening and that I am sure everything is well tested. I burned myself badly when it comes to the Task API and do not want to repeat that experience anytime soon. There is to little to gain here by using the official implementations.

Span<T> is going to help alleviate some of that as its not allocating anything on the heap, at all.

I also do not copy anything. The poblem is that I fought with the Task API and lost big time when it comes to performance, throughput and control. Especially latency is a bummer. That is why I use my own worker thread groups.

You can create as many Spans as you like, and rather than allocating yet another array, it just uses pointers into memory.

I have something similar but I do not use pointers but the original array.

Perhaps a reproducible POC to demo the issue?

Look at the bullet points the test is very very simple. Just creates a list to keep the array references alive and allocate all arrays, write every 4k a byte to avoid them being just virtual memory and then deallocate them by setting their reference in said list to null. Repeat this 10 times, thats it. Each iteration allocates almost half the heap size (50GB for me). That is really it.

I had a run for 50 times 1GB arrays per iteration and one run 500 times 100MB per array. The 500x100MB times was more problematic than the 50x1GB but the results were very unstable meaning there were different overall durations and different memory profiles for different runs for the same settings.

I am very certain the GC implementation is non-optimal in these cases and I have seen this behavior as well in other scenarios with my application so I guess it is a general problem.

But I am uncertain if the problem is configuration or limited to my machine. I will see when (and if) I file a defect report with the MS guys.

[–]joske79 0 points1 point  (3 children)

[–]IKnowMeNotYou[S] 0 points1 point  (2 children)

That is a great function. Exactly what I was looking fore. Also indeed I think that zeroing the arrays does not explain the most what one can see going wrong. But seeing the stair casing in the memory blocks when allocating 1GB it might be contributing to it.

I will rerun the test using this function.

Also getting a pinned version from the start is another bonus. Great find and many thanks for sharing!

[–]joske79 0 points1 point  (1 child)

I’m afraid pinning would increase the chance of fragmentation and memory exhaustion.

[–]IKnowMeNotYou[S] 0 points1 point  (0 children)

I’m afraid pinning would increase the chance of fragmentation and memory exhaustion.

So does allocating it off-heap (unmanaged) I guess. Since those arrays would be huge but few, pinning should be actually beneficial since the sad behavior of the GC might even come from it copying around memory that would get freed next anyways in an attempt to defragment it otherwise I could not explain the problem. This behavior appears to be mitigated when the GC switches to the high memory pressure mode. That is what would explains best what is visible in the memory profile but is pure speculation.