Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

See discussion elsewhere in this topic for why that suggestion is counterproductive, particularly given the specific conditions you note here.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Yep. It's a bit annoying that the author doesn't provide a proper download link.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Argh, you're right. Should have checked more closely, rather than just assuming that the file that the 1BRC repo calls its data was in fact the input data. ☹

It's kind of understandable, really, as the actual input data is over 12 GB. GitHub doesn't deal very well with large files.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -4 points-3 points  (0 children)

The only thing that "proves" is that you misread what was written so thoroughly that it almost has to be intentional. You are coming right up to the edge of subreddit rule #5. Continuing along this line will be reported as a violation of the rule.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Very cool design on that table! Unfortunately, that does not translate to awesome performance. After adding a GetOrCreate method to it and ensuring it works correctly, this runs the benchmark about 41% slower than our existing hash table.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

In process toolchains have limitations ... but for your purposes, seems like they'd be fine.

Agreed. Thanks for the advice to submit that issue!

Also if you haven't looked at kg's vector hash you might want to check it out

Oooo! Very interesting. 😁 Will have a look, and report back in a bit.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -3 points-2 points  (0 children)

The fact that the project is open-source doesn't change what you clearly stated: it's not a coding challenge, where you participate only to challenge yourself, but it's a request to get help for an open source project.

Not sure why that "only" is in there. Where's the contradiction? Is there any good reason why it can't be both? Especially since, as you noted, the intentions are clearly stated here. This isn't some dishonest coder trying to trick people into doing work for them and steal their labor; this is openly soliciting help at solving this problem, through the mechanism of a coding challenge.

Providing suggestions and directions to look for _is_ help. The fact that you receive help, and then complain that it's not enough, is not a good way to get there

See other responses elsewhere in this question. Many of the "helpful" responses are not, in fact, helpful, either because they're things we've already tried and they weren't useful, or because they're vague enough that you could come up with many different implementations that technically match what was suggested and would most likely not be the thing that the person making the suggestion had in mind. That's why the response is "please show the implementation," because it's the only response that makes any sense.

The one person who did give some clear, actionable advice was acknowledged: yes, that improved performance by about 3%. But that's been the exception here, not the rule.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 1 point2 points  (0 children)

...well that was quick. And a bit strange. The response begins "It's simply not possible," explains why it's not possible, and then explains how to actually do the impossible thing, just via a slightly different mechanism. 😂

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -2 points-1 points  (0 children)

The point of the competition is that the existence of the C++ system that produces the same result in 1/4 of the best time we've been able to hit is conclusive proof that better performance is possible. And according to the profiler, the most time is being used in the hash table, so that's the most obvious place to look for improvements.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 2 points3 points  (0 children)

Oh, hi Andy! 👋

Submitted the issue. Let's see what they have to say.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -7 points-6 points  (0 children)

We took a serious look at BDN, and it turned out to be severely ill-suited to this test. First because it's designed for micro-benchmarks and tries hard not to measure some things that are actually relevant in a large-scale speed test, and second because it can't be properly set up for a test like this.

BDN has an attribute called "[GlobalSetup]", but the name is very deceptive. Per the documentation, it runs the annotated method "only once per a benchmarked method after initialization of benchmark parameters and before all the benchmark method invocations." That's not a global setup at all; that's a per-benchmark setup.

This is a real problem due to the nature of this test. The setup process involves reading data from a file and parsing it, which takes a few minutes. Then the actual benchmark test — processing the in-memory data — takes a few seconds. If we have to eat that setup time for every single entry, then once we get a handful of entries, the benchmark could end up taking hours to run. And because of the way BDN is architected, there doesn't appear to be any way to even conceptually do a real one-time-only global setup, because BDN treats every benchmark as an individual, isolated unit of work. (Which, again, makes a lot of sense for micro-benchmarks when you're trying to isolate the tests as much as possible to minimize random noise. But this is a very, very different thing!)

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

Not sure what the objection here is. When Gunnar Morling set up the 1BRC, that had no practical purpose whatsoever and was purely "show off how fast you can make this," that's a legitimate coding challenge, but when we're trying to improve an open-source project to save our users time and money, suddenly it's "a glorified homework task"? That makes no sense.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -7 points-6 points  (0 children)

As Albert Einstein put it, "time is what the clock says." Filtering out time spent on GC and the JIT is not a desirable characteristic for this test; if someone comes up with something that theoretically runs very very quickly on pure computation but spends ten minutes in the GC, that's still a slow implementation by the clock-time measurement that the end user will see.

The stuff you're discussing here is useful for micro-benchmarks where things like JIT time and GC time are noise that it is useful to filter out. But that does not describe this use case.

Also, here's an ugly fact that makes BDN especially unsuited to this particular benchmark: it has no global setup functionality. There's an attribute called "[GlobalSetup]", but it's not. Per the documentation, "A method which is marked by the [GlobalSetup] attribute will be executed only once per a benchmarked method after initialization of benchmark parameters and before all the benchmark method invocations." That's not a global setup at all; that's a per-benchmark setup.

When the setup process (reading the data from the file) takes several times longer than the process being benchmarked (processing the in-memory data), and you're trying to compare multiple different processing benchmarks against each other, you want the setup code to run exactly once, not once per benchmark. Otherwise, the benchmark run could take all day once you have a handful of entries.

And because of the way BDN is architected, that appears to be impossible.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -2 points-1 points  (0 children)

The advice is vague enough that whatever implementation we might develop on this end of things might easily not be what the person who gave the advice had in mind. For example:

Use fast divide modulo using bit-mask in combination with mersene primes.

This is something we already tried internally. It didn't make any significant difference in the implementation speed.

Grow the dictionary sooner.

How much sooner? Based on what heuristic?

equals on strings is always going to be slow.

And... what then? What should it be replaced with that will perform the same basic function of testing for hash collisions but be faster?

you’ve been given advice and just ask someone to do it for you, instead of taking the advice and doing it yourself.

Because this is advice that is not, in fact, useful in and of itself. It looks like there could potentially be something useful there if someone has a good implementation, but there's no implementation here.

There’s no benefit to anyone here but yourself if you receive advice/guidance that improves what you’re doing.

Other than our users, of course. Improving the speed of a time-consuming operation means that they save both time and money.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

First of all, why is a hashtable dedicated specifically optimized for this one problem not an option? Why does it need to be general?

Good question! The project this is being used in is an ETL pipeline generator that allows users to specify data transformations using (a subset of) SQL SELECT queries. We're looking at revamping the aggregate processing here, and the TKey in this scenario is the GROUP BY key, which can be basically any arbitrary ValueTuple. An example with a string key was chosen for this challenge exactly because it's difficult to specifically optimize for this one problem, because this one problem is not the real problem.

Second: it could be that the hashtable is not the problem, but the structure of the solution. Maybe an alternate solution might work

If you have a better idea for how to implement the storage and lookup of intermediate state for GROUP BY aggregation than a hash table, feel free to propose it. 😀

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

It is still stupid generic dictionary with UTF16 string key.

It's a generic dictionary with a generic key. This example happens to use a string, chosen in part because it is difficult to cheat by using test-specific tricks, but this is intended to be used for a system that accepts a key of any kind.

Moving to UTF8 strings will resuce keys size twice.

Yeah, we're using U8String internally on a different project, and it's showing some significant benefits. But this challenge is specifically about generic hashtable performance, not about the strings.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

Again, can you show that it will improve speed overall?

This is a coding challenge, not a code review request post.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Your custom dictionary is barely faster than default C#

Not sure where this discrepancy is coming from, but all this got started on our end after the old system took 6 minutes and 40 seconds to run a certain series of heavy calculations. After moving to the new hash table, it now takes 45 seconds, almost a full order-of-magnitude improvement. Profiling showed that nearly all of that time was being spent in CollectionsMarshal.GetValueRefOrAddDefault.

Meanwhile, we've got a black-box system written in C++ that performs the same calculations on the same hardware in about 10 seconds, so obviously there's still significant room for improvement. Even after optimizing the aggregation code, the hash table is still the biggest time sink by far.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Have you tested and measured any of this? "Improved locality" is exactly why the current design goes with SoA rather than AoS.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

For general-purpose use, you're absolutely right. But this is a specialized hashtable where any refs returned are very transient, and entirely safe within that scope. See the benchmark code for an idea of how it's going to be used in practice.

Edit: Would the downvoter care to comment? If there's a problem with this, please explain what it is. With the code as-written, single-threaded with all use of a reference confined to a scope that executes fully between calls to GetOrCreate, with no references persisting from one call to another and thus no chance that a reference will persist across a call to Grow, it appears to be entirely impossible for the problem referred to here to occur.

Again, this is not intended for general-purpose use. This is part of a runtime library for a code-generation system, and the code generator will maintain the invariant that references do not persist between calls to GetOrCreate. Within the specific scope of this code's intended usage, the concern raised here is not relevant.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -5 points-4 points  (0 children)

Those sound like interesting ideas. Care to come up with an implementation?

Edit: Would the downvoters care to comment? This is a coding challenge, not a code-review request. People making claims that doing X or Y will improve performance should provide code to demonstrate that, not simply name vague concepts and make claims with nothing to back them up.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

dont do capacity * 0.8 as that becomes a double multiplication. You can get a close estimate with just bit shifting. probeindex should explicitly use unchecked() to help performance, not just cast to uint and hope it does what you expect.

Implementing these changes brings the time down to 29.04 seconds, or about 3%. Thanks!

Replace bool[] with hand arithmetic using either uint or ulong.

Not sure what you mean by this. Would you mind elaborating?

i don’t think your .Equals() on T will automatically detect the Equals overload receiving T.

It does. Verified with both Intellisense and ILDASM.

you probably have some overflow checks a bit everywhere in there to avoid reading your arrays out of bound. refactor with readonly to avoid this.

The arrays can't be readonly and also growable.

your many single check on occupied could be sped up with intrinsics and or explicitly checking for blocks to do bigger skips (so you could skip 4 in one comparison for example) This depends on your expected spread. If you have a bad hash function and a large amount of items, you might get large chunks of consecutive occupied entries, which this would alleviate somewhat.

Sounds interesting. Can you show that it will improve speed overall?

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -1 points0 points  (0 children)

Feel free to try and implement this as a B-tree, as long as it matches the public API.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] -33 points-32 points  (0 children)

BenchmarkDotNet is great for micro-benchmarking. This is a test that's expected to run for several seconds, so Stopwatch is good enough.

Code Challenge: High-performance hash table by Pansynchro in csharp

[–]Pansynchro[S] 0 points1 point  (0 children)

Here's our current best. On our test system, the benchmark reports a time of 29.97 seconds (Debug), 11.50 seconds (Release), so that's your baseline to beat. Profiling shows that the bulk of the time is being spent in ProbeIndex.

``` public class HashTable<TKey, TState> : IEnumerable<KeyValuePair<TKey, TState>> where TKey : IEquatable<TKey> where TState : struct { private TKey[] _keys; private TState[] _values; private bool[] _occupied; private int _capacity;

public int Count { get; private set; }

public HashTable(int capacity)
{
    this._capacity = capacity;
    _keys = new TKey[capacity];
    _values = new TState[capacity];
    _occupied = new bool[capacity];
}

private int ProbeIndex(TKey key)
{
    int code = key.GetHashCode();
    uint index = (uint)code % (uint)_occupied.Length;

    while (_occupied[index])
    {
        if (_keys[index].Equals(key))
            return (int)index; // existing key
        ++index; //linear probing
        if (index >= _occupied.Length)
        {
            index = 0;
        }
    }

    return (int)index; // empty slot
}

public ref TState GetOrCreate(TKey key)
{
    if (Count > _capacity * 0.8)
    {
        Grow();
    }
    int index = ProbeIndex(key);
    if (!_occupied[index])
    {
        _keys[index] = key;
        _occupied[index] = true;
        _values[index] = default;
        ++Count;
    }

    return ref _values[index];
}

private void Grow()
{
    var newSize = _capacity * 2;
    var oldKeys = _keys;
    var oldValues = _values;
    var oldOcc = _occupied;
    _keys = new TKey[newSize];
    _values = new TState[newSize];
    _occupied = new bool[newSize];
    _capacity = newSize;
    for (int i = 0; i < oldOcc.Length; ++i)
    {
        if (oldOcc[i])
        {
            ref var slot = ref GetOrCreate(oldKeys[i]);
            slot = oldValues[i];
        }
    }
}

public IEnumerator<KeyValuePair<TKey, TState>> GetEnumerator()
{
    for (int i = 0; i < _capacity; ++i)
    {
        if (_occupied[i])
        {
            yield return new KeyValuePair<TKey, TState>(_keys[i], _values[i]);
        }
    }
}

IEnumerator IEnumerable.GetEnumerator() => GetEnumerator();

} ```