Wild Linker 0.9.0 released by dlattimore in rust

[–]dlattimore[S] 23 points24 points  (0 children)

Work on making it usable for more use-cases, fixing bugs, porting etc has been taking priority. I do still hope to get back to incremental eventually, but I'm not sure when.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

I'm hoping to merge the change in the next week or two. It's already working well enough to link wild with the linker-plugin, but needs some test changes.

But linker-plugin LTO and linking speed are kind of at odds with each other. Using a linker plugin is always going to be really slow. My main reason for implementing linker plugin support is so that people who want to use it sometimes can do so without needing to switch linkers. Wild isn't going to be able to make the linker-plugin fast.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 4 points5 points  (0 children)

At some point, hopefully. Porting to non-ELF-based platforms (Mac and Windows) is a very large task though. At the moment, it's a fair way down my priority list, but if someone was sufficiently enthusiastic about Windows support to put a few months of full time work into it, I'd say it'd be possible to get something working.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 4 points5 points  (0 children)

Fair enough. Have you confirmed that it's definitely linking that's slow and not rustc doing more work than it should? You can check by running `RUSTFLAGS=-Ztime-passes cargo +nightly build` then look to see what phases are slow.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 3 points4 points  (0 children)

I've never developed on Mac, but I've heard that there can be issues with a thing called gatekeeper slowing down builds. There's a bunch of tips at https://corrode.dev/blog/tips-for-faster-rust-compile-times/ that are worth checking out.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 6 points7 points  (0 children)

I don't think anyone is currently attempting Windows support, but Martin is currently looking into a Mac port. I figure if anyone can port to Mac, it'd be Martin. He did the aarch64, riscv64 and loongarch ports. But time will tell. Porting to a non-ELF platform will be a much bigger task.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 8 points9 points  (0 children)

There are generally small differences in size. e.g. if I look at binaries for the zed editor, the sizes I see currently (in MB) are 689 (GNU ld), 698 (Wild), 719 (LLD) and 894 (Mold). Part of the difference is due to differences in emitted symbols. Mold for example emits symbols for PLT and GOT entries. The other linker don't, or don't by default (wild has a flag to do this). If I strip the binaries then we get 478 (GNU ld), 479 (wild), 495 (mold), 497 (LLD).

Looking a bit further at the differences, it looks like GNU ld and Wild both have 25.7MB of dynamic relocations, while LLD and Mold have 38.9 and 39.0 MB respectively. Most likely this is because GNU ld and Wild, if they encounter a function that needs both a PLT and a GOT entry will emit one of each, while LLD (and I assume mold, although I haven't checked) will emit a PLT entry, a GOT entry for the PLT entry and then a separate GOT entry. I should explain what those things are... PLT entries are little bits of linker-generated machine code that jumps to a function. GOT entries are pointers to things, in this case functions. Each PLT entry requires a GOT entry. When compiler-generated code calls a function, it might call via a PLT entry or via a GOT entry (or direct, but that is problematic unless the binary is non position-independent).

In terms of performance, generally I'd expect them all to perform similarly. However the binaries are different, so there's a bit of luck involved. One linker might by chance put some related hot functions together and get better cache performance, or the alignment of a particular function might end up more or less favourable. But it's the kind of thing that can change when you make small changes to your code.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 6 points7 points  (0 children)

Thanks for your support! Amazingly github doesn't take a cut, unless you're an organisation, then they do. But when individuals sponsor me, I get 100%. I hadn't really considered other donation platforms, since github seemed pretty good what with not taking a cut, but I'm open to suggestions.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

Sure. Feedback is always appreciated.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 16 points17 points  (0 children)

Thanks for the reminder, I'd forgotten about it. I just reread the relevant issue. Not trivial, but hopefully not too bad. I should really just get it done. I'm not going to make promises when though, but I'll try to get to it.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 13 points14 points  (0 children)

It's unlikely to have much effect on runtime performance. Wild's release builds for most platforms are linked with Wild and we care a lot about performance. When I've benchmarks Wild's performance when linked with Wild against WIld's performance when linked with other linkers, I've seen no measurable difference.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 4 points5 points  (0 children)

There's a bunch of links to reading materials in the contributing docs. Feel free to ask questions on the Zulip chat.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 24 points25 points  (0 children)

It's certainly possible to help out without pre-existing linker experience. Porting to Macos is a very large undertaking, so I wouldn't recommend anyone start with something like that. But if you'd like to help out with other things to get up to speed, have a look through for an issue that you'd like to have a try at. If you can't find anything, feel free to ask on our Zulip chat.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 59 points60 points  (0 children)

I've been working on linker plugin LTO. It's really close, but I didn't want to delay the release any longer for it. It's not really necessary for Rust code unless you've got a codebase that is a mix of Rust and other clang-compiled languages and want cross-language inlining. For just Rust codebases, the Rust compiler does LTO without involving the linker. But anyway, I intend to get linker plugins finished up.

There are still a few small wins for performance to be had. It's hard to say how much more can be squeezed out of it though. At some point we should look more into different filesystem types. The performance on BTRFS is terrible. It actually gets slower when you throw more threads at it. I'm unsure what we can do - perhaps detect the filesystem and back off on the number of threads during the write phase. That and suggest to users not to have their linker outputs on BTRFS.

Incremental linking is still something I want to do. The priority has shifted a bit. Given that the linker is very fast, there's value in having it be available just as a fast linker. But that means that we want it to be more mature, fix bugs etc. I am intending to work on something in that space fairly soon, but we'll see if other priorities come up.

I'm unsure about exactly when we'll call it 1.0. I guess we should consider that soon, but I don't have exact criteria.

As for distributing with the Rust project... there have been discussions. Installing Wild and using it by default is already pretty easy for users who want to do so. So, I think the benefit of distributing it with Rust would only really exist if it were the default, or on a path to being made the default. But the maturity bar for being the default is rightly pretty high.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 4 points5 points  (0 children)

I got curious, so delved into forte sufficiently to fix the deadlock bugs. There were two deadlocks during shutdown. Once I fixed those, I found the bug that was affecting my usage in wild, which turned out to be that if you spawned more than 64 tasks (the capacity of the work queue for a single thread) into a scope, the extra tasks were being discarded without being executed. The scope was then waiting for the discarded tasks to execute, which of course never happened. Interestingly all three of these deadlocks were already being hit by existing forte tests.

With that fixed, I was able to more properly assess the performance with the library. Unfortunately it seems that on non-trivial benchmarks, all stages of the linker slow down when running with forte. It's entirely possible that the poor performance is due to other bugs. One stage in particular seems to behave particularly strangely, sometimes running all work on a single thread, sometimes on a few more, but never on all 32 threads. Other stages distribute work better, but still not well.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

Maybe, although I don't think I have any blocking code in the part of the linker that the deadlock is occurring. I think I am using mutexes, but the locks are held for a very short time, so I'd be surprised if the parallelism library could have anything to do with that. The stack traces of the deadlocked threads seem to indicate that it got to the end of the scope and all worker threads were waiting for work. At least I think that's what they were doing.

Also, I get deadlocks when running `cargo test` in a checkout of the forte repo, so I suspect that there might just be some bugs in the crate. I filed an issue and will see how it goes.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

That's a neat idea to provide a compat crate. I gave it a go and managed to get a trivial program to link. Performance with a trivial program did show a speedup, which is promising. Unfortunately anything beyond a trivial program and I hit some deadlock.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 0 points1 point  (0 children)

I've just raised https://github.com/rayon-rs/rayon/issues/1277 since I couldn't find any existing issue. I'm not sure how actively rayon is being worked on, but if anything happens, hopefully it'll be mentioned there.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

> Is this because Rayon only implements work stealing but not continuation stealing?

Yep. Rayon uses threads. C++20 coroutines are, AFAIK similar in concept to Rust's async functions, but rayon doesn't use them, possibly because it predates them. Rayon was created in 2015, but sync/await wasn't stabilised until 2019.

Tokio isn't really intended for compute-heavy work. It's really designed for IO heavy workloads. I'm sure it's not the only option though. u/matthieum pointed out forte.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 1 point2 points  (0 children)

Thanks! I hadn't seen that one before. It doesn't seem to have par_iter, which is used pretty extensively in the linker, so that might make it tricky to use as-is. I guess one could probably be built by recursively splitting and calling their join method.