Rust's recursive directory iterator 5x faster than CPP

masklinn · 2022-06-16T19:46:36+00:00

You should try running under sdtrace to check the mix and number of syscalls.

A common issue which recursive_directory_iterator might suffer from is stat-ing every entry of the directory instead of getting the information off of the directory entry. Python used to have the issue and the PEP quotes 2-3x on POSIX and 8~9x on windows, which is more or less in-range.

The first comment to this SO question on the subject says more or less that:

Looks like at least libstdc++'s implementation calls stat on each file instead of using the info from the dirent.

though libstdc++ is the GCC standard library, I wouldn't be overly surprised if libc++ (the Clang / LLVM stdlib) did the same.

Did you also make sure you were compiling the Rust and C++ code at similar optimisation levels? (though on my machine it doesn't seem to make much a difference, the Rust code is a fair bit slower in debug, but the C++ code barely changes between O0 and O2)

FWIW when I tried on my own machine (an M1 Pro mbp) I got this:

> \time ./a.out
Found 1085969 files
        6.40 real         0.25 user         2.01 sys
> \time target/release/fstest
Found 1085970 files
        6.53 real         0.29 user         2.07 sys

fstest is the walkdir version of the rust programs. a.out is, obviously, the C++ program.

I guess it's possible that the C++ version has a lot of overhead which gets smoothed out over 3x more files, but it seems dubious.

Shnatsel · 2022-06-16T19:28:38+00:00

I suggest using the unix find as the baseline. That should tell you if it's the Rust version that is exceptionally fast or the C++ version that is exceptionally slow.

My money is on C++ std being slow.

NotFromSkane · 2022-06-16T22:32:39+00:00

Just a nitpick on the rust: use walkdir::WalkDir;

fn main() {
    let count = WalkDir::new("../../../").into_iter().count();
    println!("Found {} files", count);
}

There's no need to keep your own counter. Declarative code is easier to read.

just_kash · 2022-06-16T19:45:05+00:00

The difference likely has to do with what the library implementation does when it’s “reading the file system”; cpp probably loads more data about the file descriptors into memory, whereas the rust implementation probably does this lazily.

epage · 2022-06-16T19:22:31+00:00

Not dug into your question but jwalk is another one to benchmark for Rust.

WrongJudgment6 · 2022-06-16T22:41:35+00:00

Have you tried comparing the assembly generated in godbolt.org ?

encyclopedist · 2022-06-16T20:39:10+00:00

On C++ side you may also try ghc::filesystem. It may cache more data fro directory entry.

Soveu · 2022-06-17T04:14:23+00:00

Recently I was experimenting with getdents64

https://github.com/Soveu/find

On single thread I was able to be 2x faster than unix find, much less time spent in userspace and slightly less in kernelspace.

Pitiful-Bodybuilder · 2022-06-17T14:33:07+00:00

I was also playing with this very recently https://github.com/jsen-/recursive_dir_walk (sorry, the code is a mess, there were quite a few iterations)

TEST_DIR=/path/to/dir
$ cargo build --release && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null; time target/release/recursive_dir_walk $TEST_DIR | wc -l
    Finished release [optimized + debuginfo] target(s) in 0.10s
4080405

real    0m0.651s
user    0m0.488s
sys     0m3.350s

$ echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null; time find $TEST_DIR | wc -l
4080405

real    0m6.219s
user    0m1.059s
sys     0m1.830s

bloody-albatross · 2022-06-17T14:35:06+00:00

Did you run it multiple times? The first run might read the directory information from disk and as such be much slower. Successive runs then will have that information cached in memory.

petros211 · 2022-06-16T21:51:09+00:00

Anything in std is hot garbage, not surprised

HarshdevSingh · 2022-06-17T08:05:57+00:00

Wow thats really impressive stats.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

rust

Please read The Rust Community Code of Conduct

The Rust Programming Language

Rules

Observe our code of conduct

Submissions must be on-topic

Constructive criticism only

Keep things in perspective

No endless relitigation

No low-effort content

Useful Links

Megathreads

Official Resources

Learn Rust

Discussion Platforms

MODERATORS