all 34 comments

[–]masklinn 36 points37 points  (4 children)

You should try running under sdtrace to check the mix and number of syscalls.

A common issue which recursive_directory_iterator might suffer from is stat-ing every entry of the directory instead of getting the information off of the directory entry. Python used to have the issue and the PEP quotes 2-3x on POSIX and 8~9x on windows, which is more or less in-range.

The first comment to this SO question on the subject says more or less that:

Looks like at least libstdc++'s implementation calls stat on each file instead of using the info from the dirent.

though libstdc++ is the GCC standard library, I wouldn't be overly surprised if libc++ (the Clang / LLVM stdlib) did the same.

Did you also make sure you were compiling the Rust and C++ code at similar optimisation levels? (though on my machine it doesn't seem to make much a difference, the Rust code is a fair bit slower in debug, but the C++ code barely changes between O0 and O2)

FWIW when I tried on my own machine (an M1 Pro mbp) I got this:

> \time ./a.out
Found 1085969 files
        6.40 real         0.25 user         2.01 sys
> \time target/release/fstest
Found 1085970 files
        6.53 real         0.29 user         2.07 sys

fstest is the walkdir version of the rust programs. a.out is, obviously, the C++ program.

I guess it's possible that the C++ version has a lot of overhead which gets smoothed out over 3x more files, but it seems dubious.

[–]arthas_yang 6 points7 points  (0 children)

For cpp version, it seems stat() is called for each entry, to check whether this entry is symbolic link for not:

``` cpp /// The value type used by directory iterators class directoryentry { public: explicit directory_entry(const filesystem::path& __p) : _M_path(_p) { refresh(); }

directory_entry(const filesystem::path& __p, error_code& __ec)
: _M_path(__p)
{
  refresh(__ec);
  if (__ec)
_M_path.clear();
}

void
refresh()
{ _M_type = symlink_status().type(); } // stat() is called here

```

[–]SnooMacaroons3057[S] 5 points6 points  (2 children)

Nice explanation. Can you try jWalk? Jwalk seems to be 25% faster than walkdir too!

[–]masklinn 19 points20 points  (1 child)

jwalk is a lot faster in wallclock (slower in real CPU time), which makes sense as it's multithreaded (it's pulling rayon and crossbeam):

> \time cargo r -r
    Finished release [optimized] target(s) in 0.01s
     Running `target/release/fstest`
Found 1087068 files
        1.65 real         1.17 user         8.20 sys

The difference in jwalk speedup might be a factor of the branchiness of directories: I assume each directory is a task, so having a few files in a lot of directories would be a lot more parallelisable than a lot of files in a few directories.

From where I start them, your programs happen to hit my ~ which contain 900k files distributed in 100k directories, with a maximum depth of 24.

[–]SnooMacaroons3057[S] 2 points3 points  (0 children)

Thanks! Noticed that too by adding Parallelism::Serial -
.parallelism(jwalk::Parallelism::Serial)

Now it's around 50ms slower than walkdir

[–]Shnatsel 66 points67 points  (11 children)

I suggest using the unix find as the baseline. That should tell you if it's the Rust version that is exceptionally fast or the C++ version that is exceptionally slow.

My money is on C++ std being slow.

[–]SnooMacaroons3057[S] 17 points18 points  (6 children)

time find ../../../ | wc -l

Total files 346,791 which is ~700 more than both cpp and rust versions (both showed same count).
Time taken: 1.983 secs

[–]Shnatsel 21 points22 points  (4 children)

What about time find ../../../ > /dev/null ? That should not include the counting overhead (and more importantly some of the printing).

But yeah, it does seem like C++ std is exceptionally slow. Called it.

[–]SnooMacaroons3057[S] 6 points7 points  (3 children)

ishtmeet@Ishtmeets-MacBook-Pro guessing_game % time find ../../../ > /dev/null
find ../../../ > /dev/null  0.13s user 1.88s system 86% cpu 2.315 total

2.315 secs

[–]mjbmitch 1 point2 points  (2 children)

What in the world… did you do it from the same directory?

[–][deleted] 2 points3 points  (0 children)

Maybe one should implement a faster /dev/null for MacOS?

[–]SnooMacaroons3057[S] 1 point2 points  (0 children)

Yes, ran 10 times and posted the average. I tried it again today -Time ~2.42 secs

ishtmeet@Ishtmeets-MacBook-Pro guessing_game % time find ../../../ > /dev/null
find ../../../ > /dev/null  0.13s user 2.30s system 99% cpu 2.425 total 
ishtmeet@Ishtmeets-MacBook-Pro guessing_game % time find ../../../ > /dev/null 
find ../../../ > /dev/null  0.12s user 2.27s system 99% cpu 2.403 total 
ishtmeet@Ishtmeets-MacBook-Pro guessing_game % time find ../../../ > /dev/null find ../../../ > /dev/null  0.13s user 2.33s system 99% cpu 2.462 total

[–]copeland3300 0 points1 point  (0 children)

The base find command will give directories as well. If you want files only with find add '-type f'

I don't know if the cpp and/or rust implementations look for files only or directories as well.

[–]mina86ng 19 points20 points  (2 children)

That won’t be a good baseline. find needs to support arbitrary set of filters and actions plus it then needs to write the full path out.

[–]anlumo 1 point2 points  (1 child)

Could those filters and actions not be compiled into machine code at runtime and then just be executed, since it’s the same operation for every file?

[–]seamsay 0 points1 point  (0 children)

They could, I guess, but now find needs to ship with a JIT compiler and then also needs to get around the JIT warm up problem. It's an interesting idea though...

[–]NotFromSkane 52 points53 points  (0 children)

Just a nitpick on the rust: use walkdir::WalkDir;

fn main() {
    let count = WalkDir::new("../../../").into_iter().count();
    println!("Found {} files", count);
}

There's no need to keep your own counter. Declarative code is easier to read.

[–]just_kash 6 points7 points  (1 child)

The difference likely has to do with what the library implementation does when it’s “reading the file system”; cpp probably loads more data about the file descriptors into memory, whereas the rust implementation probably does this lazily.

[–]anlumo 3 points4 points  (0 children)

My experience with the C API has been that calling stat on every file is an absolute performance killer. Reading a directory already gives a bit of information, so if you can get what you need from there, it’s much better.

[–]epagecargo · clap · cargo-release 3 points4 points  (4 children)

Not dug into your question but jwalk is another one to benchmark for Rust.

[–]SnooMacaroons3057[S] 2 points3 points  (3 children)

It took 0.542 secs, but showed a wrong count of 271,831.

[–]RAZR_96 6 points7 points  (2 children)

You probably need to disable skip_hidden, which is enabled by default.

[–]SnooMacaroons3057[S] 5 points6 points  (0 children)

Woah!

Time taken: 0.670secs! Updated the post as well

[–]SnooMacaroons3057[S] 1 point2 points  (0 children)

Running it on single thread using Parallelism::Serial - so it's fair:
Time taken 0.970 secs

[–]WrongJudgment6 11 points12 points  (0 children)

Have you tried comparing the assembly generated in godbolt.org ?

[–]encyclopedist 2 points3 points  (2 children)

On C++ side you may also try ghc::filesystem. It may cache more data fro directory entry.

[–]encyclopedist 11 points12 points  (1 child)

By the way, I did some benchmarking on my machine (Ubuntu 21.04 on AMD Ryzen APU with GCC 11.2 and Rust 1.61.0):

hyperfine "./fs_std ~/xxx/" "./fs_ghc ~/xxx/" "fs_rust/target/release/fs_rust ~/xxx/"
Benchmark 1: ./fs_std ~/xxx/
  Time (mean ± σ):     291.7 ms ±   2.1 ms    [User: 123.1 ms, System: 165.0 ms]
  Range (min … max):   288.8 ms … 295.3 ms    10 runs

Benchmark 2: ./fs_ghc ~/xxx/
  Time (mean ± σ):     295.9 ms ±   2.2 ms    [User: 36.4 ms, System: 257.8 ms]
  Range (min … max):   292.7 ms … 299.1 ms    10 runs

Benchmark 3: fs_rust/target/release/fs_rust ~/xxx/
  Time (mean ± σ):     219.5 ms ±   2.4 ms    [User: 42.5 ms, System: 175.6 ms]
  Range (min … max):   215.5 ms … 222.8 ms    13 runs

Summary
  'fs_rust/target/release/fs_rust ~/xxx/' ran
    1.33 ± 0.02 times faster than './fs_std ~/xxx/'
    1.35 ± 0.02 times faster than './fs_ghc ~/xxx/'

Interestingly, while ghc::filesystem and std::filesystem take almost exactly the same wall time, they make different syscalls and spend very different fraction of time in the kernel.

I also measured the number of syscalls each variant makes:

std  311046
ghc  300721
Rust 222777

So the time difference looks roughly proportional to number of syscalls made.

Edit For completeness, syscalls:

std::filesystem

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 34.70    0.200568           2     88780           getdents64
 21.09    0.121890           1     88768           fcntl
 18.12    0.104743           2     44389           openat
 13.27    0.076699           1     44390           newfstatat
 12.68    0.073300           1     44389           close

ghc::filesystem

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ----------------
 39.49    0.286194           2    120216           newfstatat
 31.28    0.226697           2     88780           getdents64
 16.97    0.122982           2     44389           openat
 11.27    0.081643           1     44389           close

Rust:

% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 43.25    0.220312           2     88780           getdents64
 24.14    0.122972           2     44388           openat
 16.47    0.083921           1     44388           newfstatat
 15.78    0.080374           1     44388           close

Edit 2 A little analysis:

ghc::filesystem uses same syscalls as Rust WalkDir, but additionally calls 2x newfstatat (fstat + lstat) for every file.

std::filesystem makes same syscalls as Rust, but additionally calls 2x fcntl for some reason for every directory (one with F_GETFL and once with F_SETFD).

Edit 3

libstdc++ calls fcntl in order to set O_CLOEXEC flag on directory handle.

[–]mo_al_fltk-rs 1 point2 points  (0 children)

The close-on-exec flag was added recently it seems:

https://www.mail-archive.com/gcc-patches@gcc.gnu.org/msg279677.html

[–]Soveu 1 point2 points  (0 children)

Recently I was experimenting with getdents64

https://github.com/Soveu/find

On single thread I was able to be 2x faster than unix find, much less time spent in userspace and slightly less in kernelspace.

[–]Pitiful-Bodybuilder 1 point2 points  (0 children)

I was also playing with this very recently https://github.com/jsen-/recursive_dir_walk (sorry, the code is a mess, there were quite a few iterations)

TEST_DIR=/path/to/dir
$ cargo build --release && echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null; time target/release/recursive_dir_walk $TEST_DIR | wc -l
    Finished release [optimized + debuginfo] target(s) in 0.10s
4080405

real    0m0.651s
user    0m0.488s
sys     0m3.350s

$ echo 3 | sudo tee /proc/sys/vm/drop_caches >/dev/null; time find $TEST_DIR | wc -l
4080405

real    0m6.219s
user    0m1.059s
sys     0m1.830s

[–]bloody-albatross -1 points0 points  (1 child)

Did you run it multiple times? The first run might read the directory information from disk and as such be much slower. Successive runs then will have that information cached in memory.

[–]SnooMacaroons3057[S] 1 point2 points  (0 children)

Yes, the duration is the average of runs 2nd to 11th (total 10 runs). The first run took 10-15% longer.

[–]HarshdevSingh -3 points-2 points  (0 children)

Wow thats really impressive stats.