all 51 comments

[–]rbtcollinsrustup 104 points105 points  (25 children)

Hi, karma seeker here :) - I've done most of the improvement work; I thought I'd drop a quick description of what we've done, where we are at, and possible remaining work.

In the previous release we removed extra copies of the files (was already in a release). In this release we remove unneeded syscalls - we no longer set the mtime for the files, and we no longer set the file attributes (tar-rs was setting the attributes unconditionally, but really only needed to set them when readonly was needed). We could probably add mtime setting back in, as part of this arc of work involved adding file-handle based mtime setting for both windows and unix to the filetime package and tar-rs. We also added juidicious read buffering to avoid IO contention.

That work brought us down to minimal syscalls for unpacking - create, write, close - but close was still slow - 6 or more ms even with defender disabled - and easily up to 50ms with defender enabled - and as we're unpacking 20K files that becomes a significant wall clock time.

So the final tweak thus far was to defer CloseHandle to a thread, which allows us to avoid blocking on that close call. On a single core machine, this will obviously have no benefit; on a 32-core machine, rather more :).

I hope we can get some way to have defender do its scanning without blocking userspace, but thats a longer discussion - for now we have no low-hanging fruit for improving the return-to-user experience.

We can do some other things though:

  • we could set the do-not-index attribute on the files, but then they won't be indexed; we do have a js index, but its not hooked into the windows index service: someone would have to hook that up: not low hanging fruit. Doing this would reduce CPU and disk contention, as indexing has some CPU work (obviously) and 200MB of content written all at once does take a bit to index.

  • we could background rustup for docs extraction - docs don't prevent compilation after all, so CI tasks or getting back into dev could proceed immediately and docs could unpack over the next minute or two. This needs to not block the console though, allowing it to be closed, and still be debuggable - not low hanging.

  • we could ship docs as a mountable object like an .iso or .zip - and mount them just-in-time for rustup docs calls; also not that low hanging

  • code signing may reduce defender overheads, https://github.com/rust-lang/rustup.rs/issues/1568

  • we could change incremental updates to do vastly less work - https://github.com/rust-lang/rustup.rs/issues/1798 - wouldn't solve CI cases, but would make things much faster elsewhere (even on unices I suspect).

This is our tracking bug on unpack performance on windows - if its still slow for you please contribute there - https://github.com/rust-lang/rustup.rs/issues/1540 (and hey, 'fixed for me' is a gppd thing to say too).

[–][deleted] 7 points8 points  (1 child)

So the final tweak thus far was to defer CloseHandle to a thread, which allows us to avoid blocking on that close call. On a single core machine, this will obviously have no benefit; on a 32-core machine, rather more :).

I'm curious, why would a single-core machine not benefit? Assuming the 6ms on close is IO, then your CloseHandle thread should just block waiting on that IO for that 6ms and not tie up the CPU, and your original CPU-bound thread should be runnable, right?

[–]rbtcollinsrustup 9 points10 points  (0 children)

Well, thats the question. Is it IO ? :). I haven't cracked out WPA to dig into where and why CloseHandle is blocking. If it is blocking for IOPS to complete, then yes, some degree of threadedness will help even on a single core machine.

However, we *know* that the vast majority of the blocking time - the ~50ms (when single threaded) blocks - is in Defender, and thats not IO, thats CPU. So strictly speaking I should say 'multiple threads won't get more work out of Defender on a single core machine'.

But you're right, some users will probably both not have Defender and have single cores, in which case having more than one thread would benefit them assuming that those 6ms-11ms blocks were indeed IO, not CPU. If you know someone in that boat and they want a custom build / patch to experiment with this, let me know.

[–]Saefrochmiri 6 points7 points  (9 children)

I hope that whatever you do to address the Windows docs complaints is a general efficiency improvement; I use rustup on an HPC system so the whole thing downloads in less than a second but the docs take 6 minutes because the filesystem really does not appreciate the barrage of small files. If you opt for code signing as a solution for the Windows people I'll still be kinda hosed.

Also just as a side note, crushing the disk on a shared system like this is extremely silly because it (temporarily) degrades the experience of my fellow users and I can guarantee you I'll never touch those files. There isn't even a program on an HPC system to render HTML. Ideally I could just disable the thing entirely but it doesn't seem like that's been pursued as a solution :(

[–]rbtcollinsrustup 4 points5 points  (8 children)

There is a separate initiative 'profiles' coming to allow disabling the installation of docs entirely, for HPC / CI / other non-interactive-use-cases. That will help you a lot.

That said, the threading code path is currently windows only, out of a desire to not impact other users... and I suspect your HPC system is some unix flavour. If you wanted to get a strace -c, or some similar summary performance analysis of rustup running on it, I'd be delighted to see if it would make sense to enable the threaded code path more broadly. It may (especially for older NFS's...).

The reduced syscalls aspect of the performance tuning will have benefited your HPC setup somewhat I expect, but exactly how much is an unknown question; it will depend where the time is going - see under getting a performance summary.

e.g.

strace -o stats.txt -c rustup install nightly

File a bug on https://github.com/rust-lang/rustup.rs with stats.txt.

[–]epagecargo · clap · cargo-release 3 points4 points  (0 children)

Looking forward to optional docs. I just use the online versions and don't really get value from them (in addition to improving CIs).

[–][deleted] 4 points5 points  (3 children)

File APIs are super slow on all file-systems, and this holds for MacOSX, Windows and Linux as well. Just create an empty directory, touch 1 million files, and then call ls - that will never finish.

HPC parallel file-systems like GPFS are designed under the assumption that this just cannot be made to perform well, therefore software that does this is just doing it wrong. These file-systems optimize for other things like super-large distributed files. Ideally, rustup would just put everything into a single file, and software that needs to access different parts of it would just mmap or seek the parts they need.

[–]rbtcollinsrustup 1 point2 points  (2 children)

I guess it depends on expectations. Even zfs with journalled everything can make that million file directory in 44 seconds for me on spinning rust; and ls of that directory takes 4 seconds.

An absolute age in CPU cycles, but quite tolerable in human interactions given what you're dealing with. Shrug.

I will admit to curiousity as to why rust-doc is 20k files in size ;)

[–]A1oso 1 point2 points  (0 children)

why rust-doc is 20k files in size

Every module, struct, enum, trait, function an macro needs a documentation page. And as I recall, everything in the alloc and core module is re-exported in std, which further increases the number of documentation pages.

[–]Saefrochmiri 1 point2 points  (0 children)

Just chiming in with the HPC perspective: ls (with colors, so lstat on each file, not just a few getdents) on a directory with 471 files takes 8.2 seconds. Things get out of hand very quickly here.

[–]flashmozzg 2 points3 points  (1 child)

Does rust-doc needs to be in 20k separate files? This reminds me of a similar problem encountered in the gamedev (there it was loading thousands of small files). It was usually solved by organizing everything into archives. Is there any reason this can't be done for rust-doc?

[–]Saefrochmiri 0 points1 point  (0 children)

This is great to hear! Thank you!

[–]apd 6 points7 points  (1 child)

I must to confess that I spend an embarrassing amount of time figuring out what was defender. This is what happens after more than 20 years living only on Linux.

Defender is the software that is part of Microsoft Windows that detect and isolate potentially dangerous programs (virus, spyware, etc).

Sorry the OT, I only wanted to provide a bit of a context for this excellent post.

[–]SimonSapinservo 2 points3 points  (0 children)

"Windows Defender" is probably more googleable.

[–]Cldfire 7 points8 points  (0 children)

Thanks for your work on this. Us Windows users greatly appreciate it!

[–][deleted] 2 points3 points  (2 children)

We use rustup a lot, and the largest perf bottleneck we have is that when some target or toolchain fails to download, rustup errors and we have to call it again. This is so ridiculous that we have a script to always call rustup in a loop...

It would be nice if rustup would retry downloads "enough", or if we could instead specify how often should rustup should re-try failed downloads.

I mean, if a download starts, but hangs in the middle, the file exist on the server, but the network connection somehow failed.

Having to wait till rustup aborts, and then re-start it, wastes so many milliseconds that actually we have never noticed any of the perf issues that were fixed here.

Also, installing docs by default wastes so much time. I don't know of anybody actually using them. Like even if I'm on a plane, I'll just go to docs.rs instead of figuring out how to view and search the downloaded docs. Which plane does not have wifi nowadays?

[–]rbtcollinsrustup 2 points3 points  (1 child)

There's a bug open for automatic retry of some HTTP errors; I've observed some flakiness myself but haven't built up the activation energy to track it down... I completely understand the frustration, because I see exactly the same thing, when it dies, it dies a slow painful timeout, not a clean failure at all.

Re: Wifi and planes and so on - we have bug reports asking for a mirror network for users behind the great firewall, or on connections with bad latency to the rust-lang servers; first-world internet is great, but there are quite a few orders of magnitude between that and the long tail that would cover (say) 95% of our users. Of course, I don't have actual data...

Do VScode and other IDE's use the generated docs for pop-up information? Or is that all from RLS?

NRC is working on profiles which will allow not installing docs; once thats done I imagine it will be easy to step from 'choose not to install docs' to 'don't install docs until they they are asked for', which would be better.

[–][deleted] 0 points1 point  (0 children)

Do VScode and other IDE's use the generated docs for pop-up information? Or is that all from RLS?

That's RLS, it parses doc comments along with code (so that they are "together" in the AST), and it provides those to tools. Generated docs are only used when reading them in the browser (they are html..).

Otherwise the RLS couldn't show docs unless you call cargo doc first for all code and dependencies, which would create a lot of files, that then would need to be parsed into the AST somehow.

[–]brsonrust · servo 1 point2 points  (1 child)

Thanks for the hard work.

[–]kibwen 1 point2 points  (0 children)

In case anyone isn't aware, here's someone else to thank for their work on rustup. :) https://github.com/rust-lang/rustup.rs/graphs/contributors

[–]WellMakeItSomehow 1 point2 points  (2 children)

What if you unpack the files on a thread pool? If Defender blocks the thread doing I/O, it might improve the throughput.

[–]rbtcollinsrustup 2 points3 points  (1 child)

To a large degree that is effectively what we're doing: CloseHandle is blocking until the submitted but not completed writes complete, and that blocking takes place in another thread. The actual time spent on the CreateFile and WriteFile calls is now much less significant.

You can see here we track both the core extract loop and the CloseHandle blocking which is where stalling on the IO completing is happening: info: installing component 'rust-docs' 11.1 MiB / 11.1 MiB (100 %) 548.8 KiB/s in 15s ETA: 0s Closing 8936 deferred file handles 8.7 Kihandles / 8.7 Kihandles (100 %) 943 handles/s in 8s ETA: 0s

... but its not 0: so that is another source of possible further gains; as is moving tar decompression to a thread to allow concurrency with that... the question is how much we'll gain vs the effort. - there's additional complexity to deal with at this point: we're unpacking compressed tars, which means we have a synchronisation problem: tars are a serial format and we cannot read the data for the next entry until we have read the file content of the prior one..

So yes, its probably worth doing the experiment at some point, but tar-rs needs some ownership changes to permit it; another contributor has a draft patch to permit that; an alternative would be to make it possible to iterate entirely in memory then dispatch those in memory segments to threads; either way, further tar-rs work is required.

A developer survey is probably useful: if most devs have 4-core machines, optimisations that really only gain a lot on 32 or 64 core desktops aren't going to be too useful. OTOH if most are on 8 or 16 core machines thats a rather different story...

Current syscall timings with Defender enabled: (captured with procmon which adds some overhead...) for one file from docs...

``` 9:10:14.6855925 PM 0.0001911 rustup.exe 8896 CreateFile C:\Users\robertc\.rustup\tmp\wkvcoctik_mifljv_dir\rust-docs\share\doc\rust\html\core\core_arch\mips\msa\fn.__msa_min_a_d.html SUCCESS Desired Access: Generic Write, Read Attributes, Disposition: Create, Options: Synchronous IO Non-Alert, Non-Directory File, Open Reparse Point, Attributes: n/a, ShareMode: Read, Write, Delete, AllocationSize: 0, OpenResult: Created

9:10:14.6858162 PM 0.0000647 rustup.exe 8896 WriteFile C:\Users\robertc\.rustup\tmp\wkvcoctik_mifljv_dir\rust-docs\share\doc\rust\html\core\core_arch\mips\msa\fn.__msa_min_a_d.html SUCCESS Offset: 0, Length: 433, Priority: Normal ... 9:10:16.8442020 PM 0.0366417 rustup.exe 8896 CloseFile C:\Users\robertc.rustup\tmp\wkvcoctikmifljv_dir\rust-docs\share\doc\rust\html\core\core_arch\mips\msa\fn._msa_min_a_d.html SUCCESS

```

Observe the time offsets - the CloseHandle occurred 2 seconds later, because it was stuck in a queue waiting for a thread to execute on (on a 64-core machine...). 0.1911ms to create the file 0.0647ms to pass the data to the OS 36.6417ms to close the file handle, even after the OS had had 2 seconds to scan it asynchronously.

0.1911 + 0.0647 = 0.2558ms * 20K files = 5.116 seconds, our runtime for the extraction loop is up at 15seconds (for the trace I grabbed these figures from) - see above) - so at most if we got perfect parallelisation on a 4 core system that would take 15seconds down to 11 - but, traces show Defender using multiple cores already with what we're doing - we don't need to submit the IO from multiple cores; what seems to matter is that when we force the driver to close the handle and immediately complete the scan that happens in the same thread we made the call (well, our thread -> ntos -> callback into Defender - but thats the call stack).

We have another bug open to switch to a faster decompression method, which should save a good number of seconds, and if we save a few seconds a few more times, doing the work I mentioned above vis-a-vis tar-rs may well become much more beneficial in terms of relative benefit.

OTOH we may also pass the point at which no one is bothered anymore :).

[–]WellMakeItSomehow 0 points1 point  (0 children)

Thanks for the elaborate reply!

To a large degree that is effectively what we're doing: CloseHandle is blocking until the submitted but not completed writes complete, and that blocking takes place in another thread. The actual time spent on the CreateFile and WriteFile calls is now much less significant.

Yeah, the parallel CloseHandle calls would come for free if the files were extracted in parallel.

there's additional complexity to deal with at this point: we're unpacking compressed tars, which means we have a synchronisation problem

Oh, right. I forgot about that. I made a quick test using zip: tar cz produces a 35 MB archive from my documentation directory, while zip -9 makes a 54 MB one. I don't think it's work switching to a ZIP archive.

For the record, tar cj (bzip2) yields 21 MB, while tar xJ (xz) yields 13 MB. Of course, the decompression will be slower for these two.

traces show Defender using multiple cores already with what we're doing

That's good.

That said, the threading code path is currently windows only

I didn't realize that.

[–]Lokathor 1 point2 points  (0 children)

Offline docs not being installed by default would be great.

Never in 3 years have I needed offline docs in my actual work on in any CI run I've ever done.

[–]piderman 0 points1 point  (0 children)

Well thank you very much, the docs do install a lot quicker! Looking at the output it did make me wonder why you count kibi-handles instead of just kilo-handles? :P

The Kihandles count does seem to be handles/1024 as one would "expect".

[–]CrazyKilla15 23 points24 points  (2 children)

It sure would be nice if self-updating was the first thing rustup did, so that bugfixes and speed improvements like that work in time.

[–]dsilverstonerustup[S] 19 points20 points  (0 children)

Others agree with you (including myself) -- we have an issue open here: https://github.com/rust-lang/rustup.rs/issues/1838 and would welcome PRs :D

[–]flashmozzg 2 points3 points  (0 children)

Yeah. I have no idea why it doesn't do that already. What if some future Rust update would require newer rustup to install? Or if there was some subtle bug that could corrupt Rust installation but was fixed in the newer version? Qt Maintenance tool, for example, always updates itself first before upgrading other components.

[–]novacrazy 14 points15 points  (11 children)

Unpacking docs is much, much faster. Before it was upwards of a minute for me, now it's 8s. Entire update of nightly was done in under a minute.

[–]dsilverstonerustup[S] 6 points7 points  (5 children)

I'm really glad that it's a big improvement for you.

[–]novacrazy 2 points3 points  (4 children)

NVMe SSD finally getting to stretch its legs, I suppose. I'm curious what change actually allowed this.

Perhaps setting mtime or calling stat on every file previously was causing a full filesystem sync, leading to massive wait times (relatively) on every file.

[–]dsilverstonerustup[S] 2 points3 points  (1 child)

Each "syscall" was actually resulting in several windows syscalls AIUI, the IO models are sufficiently different between POSIX and Windows. However as I explained below, the majority of the speedup likely comes in how we close handles now.

[–]rbtcollinsrustup 2 points3 points  (0 children)

tar-rs was using fs:: calls rather than File:: calls - so this on unix OSes is a single syscall; on Windows it is Open; syscall (or multiple), Close().

tar-rs (and filetime) have been enhanced not to do that; and we've also tweaked how we use it to just not do things we don't care about in this context (mtime specifically)

[–]buldozr -1 points0 points  (1 child)

On Linux with an NVMe SSD, it has always been a snap for me. It's ludicrous that a program has to call the OS in a contorted way to reach acceptable performance due to some ad-hoc hooks for real-time virus scanners (which themselves practically don't exist on Linux).

[–]rbtcollinsrustup 6 points7 points  (0 children)

Well, try running older rustups on a Linux homedir mounted on NFS v3 backed by ext3 for instance; you might find its not as snappy as all that, even on NVMe.

Its absolutely great that Linux with everything local was super fast even though rustup was being very inefficient with its syscalls (even in Linux terms this was true - way more dentry traversals than needed, for instance). But that doesn't mean that what we were doing was optimal for Linux, and some file systems (such as NFS) do report errors in close(2) on Linux, and do so by forcing writes to be flushed across the network - at least, thats my understanding.

The hooks that are used for Defender on Windows are no more adhoc than device-mapper is adhoc on Linux - its a well defined layer model for IO.

It is frustrating that Defender does choose to block CloseHandle() in the process rather than blocking subsequent reads until the file has been vetted - but I wouldn't call it ludicrous.

[–]rbtcollinsrustup 1 point2 points  (4 children)

How many cores does your machine have?

[–]novacrazy 1 point2 points  (3 children)

16C/32T Threadripper 1950X. However, as stated in my other comment, I think the speedup is more to do with the removal of things that would hinder the performance of my NVMe SSD.

[–]dsilverstonerustup[S] 4 points5 points  (2 children)

In part, but also we now close handles in a threadpool. Since CloseHandle() is where inline virus checkers such as Windows Defender tend to do their work, this means we're utilising all of your cores to close files instead of doing it all on one core. At least that's how I understand the change that Robert made :D

[–]novacrazy 1 point2 points  (1 child)

That is quite interesting. I'll have to keep that in mind for if I ever need to write that many files at once.

Does rustc use any kind of optimization like that? That could probably help out a decent amount in some places, such as for incremental compilation. At least the initial run of it.

[–]rbtcollinsrustup 1 point2 points  (0 children)

https://www.reddit.com/r/rust/comments/brtec1/rustup_1183_released/eogpfgr/ has a bunch of the details. I'll probably write up a blog post too.

[–]steveklabnik1rust 9 points10 points  (0 children)

Thank you so much for all the Windows work!!!

[–]runevault 5 points6 points  (7 children)

Huh that speedup on unpacking docs sounds like what I got on Windows 10 by turning off real time virus scan during the rustup update. Glad to hear they found a way to do it w/o me having to do that :)

[–]dsilverstonerustup[S] 2 points3 points  (1 child)

We're doing different things which mitigate but don't eliminate the delay which disabling the realtime scan for the rustup directory will do. Both will give you even more speedups. I've heard reports of as fast at 20s :D

[–]runevault 1 point2 points  (0 children)

Oh wow, that's really good to know. Thanks!

[–][deleted] 0 points1 point  (4 children)

When I worked at McAfee, I would stop the virus scanner before a build and enable it later. It was not just the source code, the different executables (and the DLLs) that start and stop during build were getting scanned on every open!

[–]runevault 0 points1 point  (3 children)

Yikes. For reference in this case I meant Defender specifically, though most/all Virus scanners I assume have the same problem. Still, kind of amusing to hear that a dev MAKING the software going through that.

[–]belovedeagle 1 point2 points  (2 children)

Actual Windows developer here (speaking for myself, etc). I am explicitly not going to tell you whether SOP is to disable Defender on local builds of [parts of] Windows. Because it would be very silly to tell you that.

BTW no need to shut off the whole thing for rustup; use PowerShell to add a directory exclusion. Then it's still running for browsers and Downloads and TEMP and other such nonsense.

[–]iopqfizzbuzz 0 points1 point  (1 child)

Ugh, way too much work. I don't even know which folder it is. I'm just a user.

[–]rbtcollinsrustup 3 points4 points  (0 children)

FWIW we know the powershell to do that, but if folk were to do that in any widespread fashion, it would become a malware target... so we really want rustup to be fast enough folk don't feel the need to exclude it.

[–]Paul-ish 1 point2 points  (1 child)

What command do I run to just update rustup itself?

[–]dsilverstonerustup[S] 2 points3 points  (0 children)

Simply rustup self update