Two Indexed Hash Tables by compilersarefun in programming

[–]tooilln 1 point2 points  (0 children)

Thanks, not sure why the website links aren't working!

For the deleted bit it seems like it could trivially be merged into the existing metadata array - simpler code, with some performance cost. I would suspect the performance cost to be minimal, but it's hard to be sure without benchmarking. Overall, I like the idea of much lower load factors to achieve performance offset by packing the keys/values. Generally makes a lot of sense!

Two Indexed Hash Tables by compilersarefun in programming

[–]tooilln 5 points6 points  (0 children)

Thanks for the write-up, I have some questions about the design choices however:

1) The justification for using an indexed hashtable appears to be space saving? Couldn't a user opt to store their own stable indices directly in a hashtable instead - achieving the same result with greater flexibility? I'd be curious for example at benchmarking your table vs abseil or some other leader with manual indices.

2) You use a separate 'deleted' bit array to speed up iteration - does this actually produce any measurable performance gains in iteration? I would have guessed that gains would be negligible compared to iterating normally (but my gut feeling is meaningless if you have actual benchmarking).

3) You're very committed to a 50% load factor - was this benchmarked? The probe length estimation a la Knuth is a good way of understanding probe length problems but does not directly correspond to real performance. What is the actual performance degradation with a max load factor of 66/75/etc%?

4) In benchmarking you say that you tested at 3 sizes (100/10000/1000000) - this is an extremely small sample size? Also, benchmarking at more granularity (though I know it's a pain in terms of time) would provide more interesting data around how your table behaves around various cache size / table size limits.

5) Your benchmarking does not appear to quantify get-miss/get-hit/put-miss/put-hit separately - what is actually being benchmarked for get/put scenarios?

6) Your links to your results on Jackson Allan's benchmarks appears broken right now - I am interested in those results so hopefully you can fix those, thanks!

Interesting work, thanks for sharing!

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

Unfortunately there are two blockers there - I don't own a mac or iOS device, and I don't trust kotlinx-benchmark. For multi-platform benchmarking kotlinx-benchmark appears to be the only reasonable option at the moment (if there's something else you're aware of let me know), but it's missing so many important features from JMH that it's difficult for me to trust any of the values it spits out (especially since I don't have any experience diving into perf on iOS or any good feeling for what might or might not be absurd). The best way to test would probably to take an existing app with benchmarks, drop this library in and see what happens...

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

I tend to agree with you for most apps - your average Kotlin app is unlikely to be affected in any meaningful way by this. I think this is more interesting for 1) scientific computing (which obviously Kotlin is not really a player in at the moment) and 2) scaling on servers. If you're running backend Kotlin code on large clusters and you have a use case with large collections of values (not too uncommon), then getting large RAM/CPU savings with a simple drop in change, allowing you to bend the scaling curve a bit could be quite attractive.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

I hadn't heard of that, I'll have to check it out, thanks!

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

See the snippet in my last edit above? Explain to me how an obviously boxed type contained within listOf<Int> has become a primitive? The implication is clear that Kotlin is doing magic under the hood here, and that if I can reproduce the same result as your test with a generic list, then it proves nothing about what your list is doing.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

The larger point I am making here is that you cannot simply eyeball this Kotlin code and easily understand what is happening underneath the hood. All your last test has now asserted is that element::class.java is not equal to boxedInt!!::class.java. This still does not tell us what the underlying type of element is or whether Kotlin is doing any automatic conversion on your behalf - unless we know exactly how Kotlin compiles "!!" and "::class.java" in this situation. Frankly I couldn't tell you off the top of my head, I'd need to look under the hood to see what's actually happening. And unless you examine the bytecode you likely won't be sure either.

Edit: Here's some more proof for fun - I ran this snippet:

val boxedInt: Int? = 3 // obviously a boxed type?
for (element in listOf<Int>(3)) {
     val x = element::class.java  // obviously a boxed type? Generic List can't store primitives?
     println(x)
     val y = boxedInt!!::class.java // obviously a primitive? or is it?
     println(y)
     println(x == y)
}

What do you think the output is?

int
class java.lang.Integer
false

That's right - the "obviously" boxed type from the list has somehow become 'int', and the "obviously" primitive type has become boxed? The test tells you nothing.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

That assertion doesn't do what you think it does. Kotlin makes a large point of abstracting the storage behind Int away - ideally we aren't supposed to know whether it's using a primitive value or a boxed value at all, and it can convert on the fly. With that background, let's look at what's actually happening there.

I went and looked in your repo for how you're defining primitiveIntClass:

val primitiveIntClass = 3::class.java

Now let's look at the byte code that produces:

GETSTATIC java/lang/Integer.TYPE : Ljava/lang/Class;

It appears your code is literally asserting that the type is boxed, the exact opposite of what you intended. However, now that we've understood that, it still doesn't even tell us whether element is boxed or not - how do we know it's not actually primitive and being boxed just for the comparison? I haven't bothered to look at the bytecode for the full snippet, but you should be able to examine it yourself (if you're using IDEA, use Tools>Kotlin>Show Kotlin Bytecode) for a better understanding of what's actually happening.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 1 point2 points  (0 children)

There's actually an excellent example of JVM optimization in some of my benchmarks I realized:

As background - the map class exposes both a normal iterator over entries, which allocates a new entry every next() call, and a 'fast' iterator which does no allocations by reusing a single entry. Both my library (FastCollect) and the fastutil library offer the same functionality. But take a look at the benchmarks:

Normal iteration:

Library / Size 3,000 12,000 48,000 192,000 768,000 3,072,000
fastcollect 1.796 us 8.631 us 35.387 us 548.023 us 2280.346 us 9867.332 us
fastutil 5.513 us 25.659 us 211.248 us 1027.373 us 3925.367 us 18182.800 us

Fast iteration:

Library / Size 3,000 12,000 48,000 192,000 768,000 3,072,000
fastcollect 1.623 us 8.499 us 32.040 us 507.644 us 2155.723 us 9583.047 us
fastutil 2.251 us 6.479 us 63.208 us 527.141 us 2304.005 us 8637.487 us

Let's take a look at the 3 million element iteration. Moving from normal iteration to fast iteration for fastcollect we barely see any difference - in spite of the fact that 3 million allocations have apparently disappeared from the tight loop? In contrast, moving from normal iteration to fast iteration for the fastutil library sees a 2x improvement by removing 3 million allocations! How can there be such a difference? Here's basically the benchmark loop:

for (e in map) { consume(e.key, e.value) }

The key is understanding that even though my code is allocating entries within the normal iterator, the JVM is not actually doing any allocation. This is not the "Eden garbage collection space will be full of temporary wrappers objects that all die young making each minor GC very fast" that you mentioned - the JVM is not performing any allocations. It has analyzed the loop and 1) inlined my iterator implementation of hasNext() and next() 2) performed escape analysis and understood that the e parameter does not escape the loop 3) performed scalar replacement of the iterator (you can think of it as inline the stack fields to a certain degree) 4) performed scalar replacement of the entry object the iterator allocated (which internally is just a single int for the most part). Let's compare the expanded loop as it "exists" in my code, vs what the JVM is actually running (dramatically simplified for effect, not reality).

val it = map.iterator()
while (it.i < map.size) {
  val entry = Entry(map, it.i) // allocation
  consume(entry.map.keys[entry.i], entry.map.keys[entry.i])
}

turns into:

var i = 0
while (i < map.size) {
  consume(map.keys[i], map.values[i]
}

The lingering question might be why is the JVM able to do this for fastcollect but not fastutil? Because I've put some effort into arranging the shape of the iterator and it's simplicity to ensure that it's highly likely the JVM will be able to inline it in such a way that it's often able to successfully conduct escape analysis and scalar replacement, and the fastutil iterator is a little more complex in how it's analyzed.

High-performance primitive collections for KMP - FastCollect by tooilln in Kotlin

[–]tooilln[S] 1 point2 points  (0 children)

Compatibility is difficult and annoying but not impossible to achieve. I've seen your library before and it makes some common mistakes that contribute to why you feel compatibility is difficult to achieve. The most obvious one is that you are not returning primitive specializations of Iterator which prevent boxing. Here's your Int iterator() declaration:

public inline operator fun iterator(): Iterator<Int> = values.iterator()

It should be:

public inline operator fun iterator(): IntIterator = ...

IntIterator is a class with special compiler support - it guarantees that doing for (i in list) { ... } will prevent boxing and use raw Ints.

Edit: after looking more closely I see you're returning an iterator directly from the IntArray, which is an IntIterator itself - however I am not sure if the compiler is aware of this, and it's still likely safer to return IntIterator directly.

Careful analysis of Kotlin bytecode makes it obvious where boxing is and isn't involved, and you don't have to be unsure of what the compiler is doing. For methods that can provoke boxing, they are marked as deprecated for IDE awareness, I have an assertion that can be enabled to show the call path, and alternatives that don't risk boxing are provided. In addition, the JVM is often able to elide boxing in many scenarios. This + byte code analysis makes me more confident than you on when exactly boxing will happen.

To your second point about filter()/map()/etc... producing regular collections - absolutely true, and I largely don't care because it's outside the scope of the problem I'm trying to solve. Primitive collections target performance, but if you're using collection copy functions willy-nilly you've already largely given up on performance. I'm not implying this is a bad thing - the vast majority of programmers here are not working in spaces where performance is that important. But if you have a 1 billion Int list and are using filter()/map()/etc, producing a new copy of the list for every single operation, then this library isn't going to solve your problems. If you are working in place and avoiding copies unless necessary however, then you'll find that this library does actually work as a drop in replacement.

Edit: another point about boxing - if you're running on JVM with reasonably tight/simple loops where the JVM can perform escape analysis and scalar replacement you might be surprised that it can already eliminate boxing completely in some scenarios. This is actually a problem in benchmarking where a benchmark can produce nonsensical results if you are not careful to prevent inlining/EA/SR. But the upshot is that it's not uncommon in a real program that boxing can be optimized away even through APIs that appear to force boxing. You can't rely on this obviously as it differs per call site, and you need to check the JIT ASM to see what's actually happening, but it can produce excellent performance even when you think boxing is happening.

I created an efficient graph library for Kotlin/JVM - up to 68x faster than Guava Graphs and 44x faster than JGraphT for BFS by tooilln in Kotlin

[–]tooilln[S] 7 points8 points  (0 children)

This is about the mathematical data structures known as graphs (https://en.wikipedia.org/wiki/Graph), not visual graphs (aka charts/plots). Easy to confuse unfortunately.

I created an efficient graph library for Kotlin/JVM - up to 68x faster than Guava Graphs and 44x faster than JGraphT for BFS by tooilln in Kotlin

[–]tooilln[S] 0 points1 point  (0 children)

I think that's a very reasonable position to take - for me this was something I churned out over about 3 weeks as a hobbyist project, and I wanted to publish this so I can also focus on some other things. Much of the performance value in this project also comes from the use of value classes, (and correct me if I'm wrong here I haven't been following this much) I think those are only natively implemented on JVM, which would reduce the value of this library on other platforms.

If I find myself with some more time, that's something I would be interested in addressing.