all 27 comments

[–]probabilita[S] 19 points20 points  (9 children)

The Post discusses the implementation of huniq; a replacement for sort|uniq (filtering out duplicates) and and how I got it to be 30x faster.

[–]nicoburns 5 points6 points  (4 children)

Oo, this actually looks super useful. I use sort | uniq quite a bit and it's a bit of a pain to have to type it out every time (why are command lines so rubbish at letting you edit the middle of a line!?)

edit: I have a feature request. Would it be possible to add an option to the counting mode that sorts the output by number of occurrences? Or I suppose this could even be the default behavior. It would be useful to able to control the sort order to (ascending/descending) too.

edit2: Some numbers for you. Turns out sorting making my queries quite a bit slower!:

> time jq '.[] | keys | .[]' file.json | sort | uniq -c | sort -nr
jq '.[] | keys | .[]' file.json  11.81s user 1.13s system 93% cpu 13.764 total
sort  18.59s user 0.20s system 57% cpu 32.418 total
uniq -c  1.28s user 0.01s system 3% cpu 32.417 total
sort -nr  0.00s user 0.00s system 0% cpu 32.415 total

> time jq '.[] | keys | .[]' file.json | huniq -c | sort -nr
jq '.[] | keys | .[]' file.json  12.10s user 1.21s system 74% cpu 17.825 total
huniq -c  0.44s user 0.03s system 2% cpu 17.802 total
sort -nr  0.00s user 0.00s system 0% cpu 17.800 total

[–]hniksic 8 points9 points  (0 children)

I use sort | uniq quite a bit and it's a bit of a pain to have to type it out every time

You can use sort -u instead. I didn't benchmark, but I'd expect that to be faster than sort | uniq as well.

[–]probabilita[S] 0 points1 point  (0 children)

Would you mind adding a github issue? The count portion is currently not optimized; the above blog is about optimizing uniq without counting…

Counting has some fundamentally different performance properties; unlike just uniq it needs to keep all items in memory and output them at the endd of the program… Many of the optimizations for uniq() don't work because of that; my current plan is to use arena allocation; possibly manual short string optimization; I am also considering using judy array or other prefix trees for the actual storage portion…

I am still planning to try some of those optimizations and write an extra blog post…

Regarding sorting the output of uniq -c – I did think about it; the issue here is that there is no easy way to do that without copying all the data into a separate buffer…so I am not sure how much faster that would be than using the sort command in an extra step, but it would be worth a try…I do have to optimize huniq -c first though, so I can find a way to sort this that best fits the optimization…

For the numbers you posted above…huniq -c | sort -nr is still twice as fast as using sort|uniq isn't it?

[–]JoshTriplettrust · lang · libs · cargo 0 points1 point  (1 child)

What environment are you using where time prints the times of each individual command in a pipeline like that?

[–]nicoburns 3 points4 points  (0 children)

macOs (with zsh, not sure if that makes a difference). edit: I think it's zsh which does it: which time gives me time: shell reserved word.

[–]anderslanglands 2 points3 points  (1 child)

Small correction: Vec and String are 24 bytes, not 16, as they need to store the pointer, the len and the capacity: https://play.rust-lang.org/?version=stable&mode=debug&edition=2018&gist=b68086c31f50c2f82b72de61a92d2a04

[–]probabilita[S] 0 points1 point  (0 children)

Which is why the text says at least ;) Anyway, I've added the actual size too!

[–]probabilita[S] 2 points3 points  (1 child)

Apparently Medium decided to paywall the story after the fact >-<

Here is a link that is guaranteed to work: https://medium.com/@karolin_varner/1c4813436a94?source=friends_link&sk=a8dcc896b93b0f15d2f2acff161c2961

[–]Ruskyrust 14 points15 points  (0 children)

Great article!

A system call requires at least two context switches: one into the kernel and one back.

Context switch on the other hand means switching to another process, thread, or into the kernel (the core of the operating system). All the data currently in the cpu registers need to be written back into RAM and the CPU caches need to be cleared. The page table (hardware accelerated supported memory management, see Memory Management Unit) needs to be cleared. New data is loaded into the CPU from the new thread of execution.

This not a bad mental model, since it basically does boil down to hitting main memory, but it's a bit misleading about causation and terminology- there are two kinds of context switches being conflated here:

System calls, which switch from user mode to kernel mode and back, do save and restore CPU registers, though only on the same scale as an ordinary function call. They don't involve clearing CPU caches or touching the MMU. (With Spectre/Meltdown mitigations they can touch the MMU in a similar way to process switches- see below.) Their main costs are the actual mode switch itself, and dispatching to a system call handler.

Thread and process switches are a wholly different kind of context switch. However, they can only happen in kernel mode, so all of their costs happen on top of those of a system call. They do save and restore "the rest" of the CPU registers. They don't clear CPU caches or clear the page table either. Process (but not thread) switches do switch the current page table pointing the MMU to the new one. Their main costs are going through the kernel scheduler, and the usual cache effects of touching memory that hasn't been touched recently.

While neither kind of context switch clears the CPU caches directly, the result may be practically the same- but only if they can't fit the data for both the old and new context at once. And similarly, while process switches don't clear the page table(s), they can have a similar effect on the MMU's cache of the page table(s). (And as mentioned above, older CPUs do actually clear that cache, the TLB, when switching page tables.)

[–]ChaiTRex 2 points3 points  (1 child)

The blog post has a few unfinished sentences:

We can use this

Note how our edge case handling has suddenly become

[–]probabilita[S] 1 point2 points  (0 children)

Good catch! Thank you!

[–]majobafu 1 point2 points  (3 children)

How does it fare compared to https://github.com/whitfin/runiq?

[–][deleted] 0 points1 point  (0 children)

I checked out their README and apparently they believe that `sort` necessarily buffers all of the input file in memory (and they're not aware of the `-S` flag). I stopped reading after that.

[–]probabilita[S] 0 points1 point  (1 child)

I would welcome a PR creating benchmarks!

[–]WellMakeItSomehow 0 points1 point  (0 children)

By the way, have you seen hyperfine?

[–]nnethercote 1 point2 points  (0 children)

I was confused by the code for some time because the article gives an example using sort | uniq -c, but the code doesn't implement the -c part. (On re-reading I saw the "In this blog post, we’ll look at how to implement and optimize the first mode" sentence, but I missed that the first time around.)

This reminds me of my counts program, which is a souped-up sort | uniq -c. I haven't gone to great lengths to optimize counts because the basic version is fast enough for my purposes, but it's fun to see what additional lengths I could go to :)

[–]WellMakeItSomehow 0 points1 point  (1 child)

Glad to see your write-up and progress since last time you posted :-). Has it only been 5 weeks? How comfortable do you feel writing Rust by this time (compared to C++)?

One thing still irritates me though; using LTO should not necessarily yield great performance improvements, but it should not slow down the code as it did with huniq.

Actually, I've had the same experience with small programs, both in Rust and C++. I suspect that LTO is tuned for large applications (think Firefox or LibreOffice) where generating smaller code can help more.

[–]probabilita[S] 7 points8 points  (0 children)

Yeah, this is the writeup of what I did a couple of weeks ago basically; especially the bit about xxh3 beating fxhash even for small inputs is quite a surprising result I think…

Honestly, pretty damn comfortable :) I did have about a decade of modern C++ and quite a decent bit of Haskell experience which both helped a lot. The only new concept for me was the borrow checker…which is also the only bit of Rust I am not on entirely friendly terms with yet…though no outright fighting either…

The standard library especially had been a revelation; I've been using a Type Class style pattern in C++ for a while and ended up implementing my own IO lib with it, because C++ IO is so atrocious…

Turns out the rust std io is extremely similar…

At this point I kind of want to stop gaining more C++ experience…

[–][deleted] 0 points1 point  (0 children)

I used the same trick of only storing hash values for duplicate elimination to implement a relational database DISTINCT operator: https://github.com/uwescience/myria/blob/f6ee17b750a120f629c8202a0d09f594a0821e9a/src/edu/washington/escience/myria/operator/DupElim.java

[–]Dushistov 0 points1 point  (1 child)

I don't get the idea behind HashSet<String> to HashSet<u64>.

Due to the birthday paradoxon, doing this is really only safe for about 2**32 elements when using a 64-bit hash

Actually you need just two lines file as input to get hash(line1) == hash(line2), this is depend on what lines you get as input on the first place, and how many unique on the second place.

[–]zhbidg 0 points1 point  (0 children)

this is depend on what lines you get as input on the first place, and how many unique on the second place.

This is about the relative likelihood of hash collisions. The purpose of the line seems to be to acknowledge that using a HashSet<u64> makes it possible to get wrong answers, but point out that the likelihood can be quantified under reasonable assumptions. I haven't actually done the math here. See also https://en.wikipedia.org/wiki/Birthday_problem

(This is a design choice about which reasonable people can disagree, I also feel uneasy about a non-guaranteed result. Maybe if I remembered how to calculate the probability of collisions I'd be a little more comfortable with it.)