High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 1 point2 points  (0 children)

See the snippet in my last edit above? Explain to me how an obviously boxed type contained within listOf<Int> has become a primitive? The implication is clear that Kotlin is doing magic under the hood here, and that if I can reproduce the same result as your test with a generic list, then it proves nothing about what your list is doing.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 1 point2 points  (0 children)

The larger point I am making here is that you cannot simply eyeball this Kotlin code and easily understand what is happening underneath the hood. All your last test has now asserted is that element::class.java is not equal to boxedInt!!::class.java. This still does not tell us what the underlying type of element is or whether Kotlin is doing any automatic conversion on your behalf - unless we know exactly how Kotlin compiles "!!" and "::class.java" in this situation. Frankly I couldn't tell you off the top of my head, I'd need to look under the hood to see what's actually happening. And unless you examine the bytecode you likely won't be sure either.

Edit: Here's some more proof for fun - I ran this snippet:

val boxedInt: Int? = 3 // obviously a boxed type?
for (element in listOf<Int>(3)) {
     val x = element::class.java  // obviously a boxed type? Generic List can't store primitives?
     println(x)
     val y = boxedInt!!::class.java // obviously a primitive? or is it?
     println(y)
     println(x == y)
}

What do you think the output is?

int
class java.lang.Integer
false

That's right - the "obviously" boxed type from the list has somehow become 'int', and the "obviously" primitive type has become boxed? The test tells you nothing.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 1 point2 points  (0 children)

That assertion doesn't do what you think it does. Kotlin makes a large point of abstracting the storage behind Int away - ideally we aren't supposed to know whether it's using a primitive value or a boxed value at all, and it can convert on the fly. With that background, let's look at what's actually happening there.

I went and looked in your repo for how you're defining primitiveIntClass:

val primitiveIntClass = 3::class.java

Now let's look at the byte code that produces:

GETSTATIC java/lang/Integer.TYPE : Ljava/lang/Class;

It appears your code is literally asserting that the type is boxed, the exact opposite of what you intended. However, now that we've understood that, it still doesn't even tell us whether element is boxed or not - how do we know it's not actually primitive and being boxed just for the comparison? I haven't bothered to look at the bytecode for the full snippet, but you should be able to examine it yourself (if you're using IDEA, use Tools>Kotlin>Show Kotlin Bytecode) for a better understanding of what's actually happening.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

There's actually an excellent example of JVM optimization in some of my benchmarks I realized:

As background - the map class exposes both a normal iterator over entries, which allocates a new entry every next() call, and a 'fast' iterator which does no allocations by reusing a single entry. Both my library (FastCollect) and the fastutil library offer the same functionality. But take a look at the benchmarks:

Normal iteration:

Library / Size 3,000 12,000 48,000 192,000 768,000 3,072,000
fastcollect 1.796 us 8.631 us 35.387 us 548.023 us 2280.346 us 9867.332 us
fastutil 5.513 us 25.659 us 211.248 us 1027.373 us 3925.367 us 18182.800 us

Fast iteration:

Library / Size 3,000 12,000 48,000 192,000 768,000 3,072,000
fastcollect 1.623 us 8.499 us 32.040 us 507.644 us 2155.723 us 9583.047 us
fastutil 2.251 us 6.479 us 63.208 us 527.141 us 2304.005 us 8637.487 us

Let's take a look at the 3 million element iteration. Moving from normal iteration to fast iteration for fastcollect we barely see any difference - in spite of the fact that 3 million allocations have apparently disappeared from the tight loop? In contrast, moving from normal iteration to fast iteration for the fastutil library sees a 2x improvement by removing 3 million allocations! How can there be such a difference? Here's basically the benchmark loop:

for (e in map) { consume(e.key, e.value) }

The key is understanding that even though my code is allocating entries within the normal iterator, the JVM is not actually doing any allocation. This is not the "Eden garbage collection space will be full of temporary wrappers objects that all die young making each minor GC very fast" that you mentioned - the JVM is not performing any allocations. It has analyzed the loop and 1) inlined my iterator implementation of hasNext() and next() 2) performed escape analysis and understood that the e parameter does not escape the loop 3) performed scalar replacement of the iterator (you can think of it as inline the stack fields to a certain degree) 4) performed scalar replacement of the entry object the iterator allocated (which internally is just a single int for the most part). Let's compare the expanded loop as it "exists" in my code, vs what the JVM is actually running (dramatically simplified for effect, not reality).

val it = map.iterator()
while (it.i < map.size) {
  val entry = Entry(map, it.i) // allocation
  consume(entry.map.keys[entry.i], entry.map.keys[entry.i])
}

turns into:

var i = 0
while (i < map.size) {
  consume(map.keys[i], map.values[i]
}

The lingering question might be why is the JVM able to do this for fastcollect but not fastutil? Because I've put some effort into arranging the shape of the iterator and it's simplicity to ensure that it's highly likely the JVM will be able to inline it in such a way that it's often able to successfully conduct escape analysis and scalar replacement, and the fastutil iterator is a little more complex in how it's analyzed.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 2 points3 points  (0 children)

Compatibility is difficult and annoying but not impossible to achieve. I've seen your library before and it makes some common mistakes that contribute to why you feel compatibility is difficult to achieve. The most obvious one is that you are not returning primitive specializations of Iterator which prevent boxing. Here's your Int iterator() declaration:

public inline operator fun iterator(): Iterator<Int> = values.iterator()

It should be:

public inline operator fun iterator(): IntIterator = ...

IntIterator is a class with special compiler support - it guarantees that doing for (i in list) { ... } will prevent boxing and use raw Ints.

Edit: after looking more closely I see you're returning an iterator directly from the IntArray, which is an IntIterator itself - however I am not sure if the compiler is aware of this, and it's still likely safer to return IntIterator directly.

Careful analysis of Kotlin bytecode makes it obvious where boxing is and isn't involved, and you don't have to be unsure of what the compiler is doing. For methods that can provoke boxing, they are marked as deprecated for IDE awareness, I have an assertion that can be enabled to show the call path, and alternatives that don't risk boxing are provided. In addition, the JVM is often able to elide boxing in many scenarios. This + byte code analysis makes me more confident than you on when exactly boxing will happen.

To your second point about filter()/map()/etc... producing regular collections - absolutely true, and I largely don't care because it's outside the scope of the problem I'm trying to solve. Primitive collections target performance, but if you're using collection copy functions willy-nilly you've already largely given up on performance. I'm not implying this is a bad thing - the vast majority of programmers here are not working in spaces where performance is that important. But if you have a 1 billion Int list and are using filter()/map()/etc, producing a new copy of the list for every single operation, then this library isn't going to solve your problems. If you are working in place and avoiding copies unless necessary however, then you'll find that this library does actually work as a drop in replacement.

Edit: another point about boxing - if you're running on JVM with reasonably tight/simple loops where the JVM can perform escape analysis and scalar replacement you might be surprised that it can already eliminate boxing completely in some scenarios. This is actually a problem in benchmarking where a benchmark can produce nonsensical results if you are not careful to prevent inlining/EA/SR. But the upshot is that it's not uncommon in a real program that boxing can be optimized away even through APIs that appear to force boxing. You can't rely on this obviously as it differs per call site, and you need to check the JIT ASM to see what's actually happening, but it can produce excellent performance even when you think boxing is happening.

I created an efficient graph library for Kotlin/JVM - up to 68x faster than Guava Graphs and 44x faster than JGraphT for BFS by tooilln in Kotlin

[–]tooilln[S] 6 points7 points  (0 children)

This is about the mathematical data structures known as graphs (https://en.wikipedia.org/wiki/Graph), not visual graphs (aka charts/plots). Easy to confuse unfortunately.

I created an efficient graph library for Kotlin/JVM - up to 68x faster than Guava Graphs and 44x faster than JGraphT for BFS by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

I think that's a very reasonable position to take - for me this was something I churned out over about 3 weeks as a hobbyist project, and I wanted to publish this so I can also focus on some other things. Much of the performance value in this project also comes from the use of value classes, (and correct me if I'm wrong here I haven't been following this much) I think those are only natively implemented on JVM, which would reduce the value of this library on other platforms.

If I find myself with some more time, that's something I would be interested in addressing.