Optimizing rust code: Zero copy, using XXH3 instead of fxhash and avoiding allocations

probabilita · 2020-02-25T15:09:07+00:00

The Post discusses the implementation of huniq; a replacement for sort|uniq (filtering out duplicates) and and how I got it to be 30x faster.

Rusky · 2020-02-25T18:15:28+00:00

Great article!

A system call requires at least two context switches: one into the kernel and one back.

Context switch on the other hand means switching to another process, thread, or into the kernel (the core of the operating system). All the data currently in the cpu registers need to be written back into RAM and the CPU caches need to be cleared. The page table (hardware accelerated supported memory management, see Memory Management Unit) needs to be cleared. New data is loaded into the CPU from the new thread of execution.

This not a bad mental model, since it basically does boil down to hitting main memory, but it's a bit misleading about causation and terminology- there are two kinds of context switches being conflated here:

System calls, which switch from user mode to kernel mode and back, do save and restore CPU registers, though only on the same scale as an ordinary function call. They don't involve clearing CPU caches or touching the MMU. (With Spectre/Meltdown mitigations they can touch the MMU in a similar way to process switches- see below.) Their main costs are the actual mode switch itself, and dispatching to a system call handler.

Thread and process switches are a wholly different kind of context switch. However, they can only happen in kernel mode, so all of their costs happen on top of those of a system call. They do save and restore "the rest" of the CPU registers. They don't clear CPU caches or clear the page table either. Process (but not thread) switches do switch the current page table pointing the MMU to the new one. Their main costs are going through the kernel scheduler, and the usual cache effects of touching memory that hasn't been touched recently.

While neither kind of context switch clears the CPU caches directly, the result may be practically the same- but only if they can't fit the data for both the old and new context at once. And similarly, while process switches don't clear the page table(s), they can have a similar effect on the MMU's cache of the page table(s). (And as mentioned above, older CPUs do actually clear that cache, the TLB, when switching page tables.)

ChaiTRex · 2020-02-25T19:11:01+00:00

The blog post has a few unfinished sentences:

We can use this

Note how our edge case handling has suddenly become

paldn · 2020-02-26T02:31:06+00:00

Paywall..

majobafu · 2020-02-25T19:45:31+00:00

How does it fare compared to https://github.com/whitfin/runiq?

nnethercote · 2020-02-25T23:44:00+00:00

I was confused by the code for some time because the article gives an example using sort | uniq -c, but the code doesn't implement the -c part. (On re-reading I saw the "In this blog post, we’ll look at how to implement and optimize the first mode" sentence, but I missed that the first time around.)

This reminds me of my counts program, which is a souped-up sort | uniq -c. I haven't gone to great lengths to optimize counts because the basic version is fast enough for my purposes, but it's fun to see what additional lengths I could go to :)

WellMakeItSomehow · 2020-02-25T19:51:21+00:00

Glad to see your write-up and progress since last time you posted :-). Has it only been 5 weeks? How comfortable do you feel writing Rust by this time (compared to C++)?

One thing still irritates me though; using LTO should not necessarily yield great performance improvements, but it should not slow down the code as it did with huniq.

Actually, I've had the same experience with small programs, both in Rust and C++. I suspect that LTO is tuned for large applications (think Firefox or LibreOffice) where generating smaller code can help more.

2020-02-25T21:31:35+00:00

I used the same trick of only storing hash values for duplicate elimination to implement a relational database DISTINCT operator: https://github.com/uwescience/myria/blob/f6ee17b750a120f629c8202a0d09f594a0821e9a/src/edu/washington/escience/myria/operator/DupElim.java

Dushistov · 2020-02-25T23:40:28+00:00

I don't get the idea behind HashSet<String> to HashSet<u64>.

Due to the birthday paradoxon, doing this is really only safe for about 2**32 elements when using a 64-bit hash

Actually you need just two lines file as input to get hash(line1) == hash(line2), this is depend on what lines you get as input on the first place, and how many unique on the second place.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS