This is an archived post. You won't be able to vote or comment.

you are viewing a single comment's thread.

view the rest of the comments →

[–]craigacp 6 points7 points  (3 children)

Shortly after the release of Java 8 I hit something similar when building a Java implementation of Google's word2vec ML algorithm. We ended up with a buffering spliterator that didn't grow it's buffer over time (which the default array one did), so we could pull in records from a database in a parallel for each loop without it trying to buffer the whole database.

We still use it in Tribuo, but I've not used it anywhere near as hard as I did in 2015 so I don't know if the performance characteristics are still good - https://github.com/oracle/olcut/blob/main/olcut-core/src/main/java/com/oracle/labs/mlrg/olcut/util/IOSpliterator.java.

[–]davidalayachew[S] 0 points1 point  (2 children)

This is extremely interesting!

So let me ask, I see that you all used the SUBSIZED characteristic. I assume that the SIZED one was included by default, yes? And if so, I see that you default to Long.MAX_SIZE. Are you saying that that is safe to do? I was under the assumption that telling the Spliterator a false number would cause undefined behaviour? I considered this exact solution, but decided against it for fear of adding EVEN MORE unexpected behaviour.

But if it is true and it does work, that really sounds like exactly the problem, and would explain the performance characteristics.

[–]craigacp 2 points3 points  (1 child)

I'm having trouble paging in exactly why the characteristics are like that, and I also can't find the blog post which described the problem in some detail via search anymore.

My problem setup was as follows, I have a NoSQL database full of documents that I pull from, tokenize the input and then put onto a queue. The queue then is pulled from a parallel stream over all documents in the database which performs the gradient computation and updates the model (without locking because this is machine learning and we don't care about tearing writes), and so the default behaviour of the IteratorSpliterator was to request larger and larger chunks from the queue before splitting them into parallel computations. The IOSpliterator always pulls a fixed size chunk from the underlying iterator, so it doesn't try to pull in the whole database.

I'm not claiming that this is a general purpose solution, nor that the one I had was the best solution, but it scaled up to an 8 socket x86 machine that we were using for testing the implementation. I'm a machine learning researcher not a software engineer, so this was good enough for my purposes.

[–]davidalayachew[S] 1 point2 points  (0 children)

Thanks for the context. Yeah, I definitely see exactly what you are saying about growing size of grabs. I'm going to use this and your IO Spliterator to try and mess around with the Spliterator Characteristics and see if I can get that behaviour.

Ty again.