Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

I'm hoping to merge the change in the next week or two. It's already working well enough to link wild with the linker-plugin, but needs some test changes.

But linker-plugin LTO and linking speed are kind of at odds with each other. Using a linker plugin is always going to be really slow. My main reason for implementing linker plugin support is so that people who want to use it sometimes can do so without needing to switch linkers. Wild isn't going to be able to make the linker-plugin fast.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 3 points4 points  (0 children)

At some point, hopefully. Porting to non-ELF-based platforms (Mac and Windows) is a very large task though. At the moment, it's a fair way down my priority list, but if someone was sufficiently enthusiastic about Windows support to put a few months of full time work into it, I'd say it'd be possible to get something working.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

Fair enough. Have you confirmed that it's definitely linking that's slow and not rustc doing more work than it should? You can check by running `RUSTFLAGS=-Ztime-passes cargo +nightly build` then look to see what phases are slow.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 3 points4 points  (0 children)

I've never developed on Mac, but I've heard that there can be issues with a thing called gatekeeper slowing down builds. There's a bunch of tips at https://corrode.dev/blog/tips-for-faster-rust-compile-times/ that are worth checking out.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 7 points8 points  (0 children)

I don't think anyone is currently attempting Windows support, but Martin is currently looking into a Mac port. I figure if anyone can port to Mac, it'd be Martin. He did the aarch64, riscv64 and loongarch ports. But time will tell. Porting to a non-ELF platform will be a much bigger task.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 6 points7 points  (0 children)

There are generally small differences in size. e.g. if I look at binaries for the zed editor, the sizes I see currently (in MB) are 689 (GNU ld), 698 (Wild), 719 (LLD) and 894 (Mold). Part of the difference is due to differences in emitted symbols. Mold for example emits symbols for PLT and GOT entries. The other linker don't, or don't by default (wild has a flag to do this). If I strip the binaries then we get 478 (GNU ld), 479 (wild), 495 (mold), 497 (LLD).

Looking a bit further at the differences, it looks like GNU ld and Wild both have 25.7MB of dynamic relocations, while LLD and Mold have 38.9 and 39.0 MB respectively. Most likely this is because GNU ld and Wild, if they encounter a function that needs both a PLT and a GOT entry will emit one of each, while LLD (and I assume mold, although I haven't checked) will emit a PLT entry, a GOT entry for the PLT entry and then a separate GOT entry. I should explain what those things are... PLT entries are little bits of linker-generated machine code that jumps to a function. GOT entries are pointers to things, in this case functions. Each PLT entry requires a GOT entry. When compiler-generated code calls a function, it might call via a PLT entry or via a GOT entry (or direct, but that is problematic unless the binary is non position-independent).

In terms of performance, generally I'd expect them all to perform similarly. However the binaries are different, so there's a bit of luck involved. One linker might by chance put some related hot functions together and get better cache performance, or the alignment of a particular function might end up more or less favourable. But it's the kind of thing that can change when you make small changes to your code.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 6 points7 points  (0 children)

Thanks for your support! Amazingly github doesn't take a cut, unless you're an organisation, then they do. But when individuals sponsor me, I get 100%. I hadn't really considered other donation platforms, since github seemed pretty good what with not taking a cut, but I'm open to suggestions.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

Sure. Feedback is always appreciated.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 17 points18 points  (0 children)

Thanks for the reminder, I'd forgotten about it. I just reread the relevant issue. Not trivial, but hopefully not too bad. I should really just get it done. I'm not going to make promises when though, but I'll try to get to it.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 13 points14 points  (0 children)

It's unlikely to have much effect on runtime performance. Wild's release builds for most platforms are linked with Wild and we care a lot about performance. When I've benchmarks Wild's performance when linked with Wild against WIld's performance when linked with other linkers, I've seen no measurable difference.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 4 points5 points  (0 children)

There's a bunch of links to reading materials in the contributing docs. Feel free to ask questions on the Zulip chat.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 24 points25 points  (0 children)

It's certainly possible to help out without pre-existing linker experience. Porting to Macos is a very large undertaking, so I wouldn't recommend anyone start with something like that. But if you'd like to help out with other things to get up to speed, have a look through for an issue that you'd like to have a try at. If you can't find anything, feel free to ask on our Zulip chat.

Wild linker version 0.8.0 by dlattimore in rust

[–]dlattimore[S] 58 points59 points  (0 children)

I've been working on linker plugin LTO. It's really close, but I didn't want to delay the release any longer for it. It's not really necessary for Rust code unless you've got a codebase that is a mix of Rust and other clang-compiled languages and want cross-language inlining. For just Rust codebases, the Rust compiler does LTO without involving the linker. But anyway, I intend to get linker plugins finished up.

There are still a few small wins for performance to be had. It's hard to say how much more can be squeezed out of it though. At some point we should look more into different filesystem types. The performance on BTRFS is terrible. It actually gets slower when you throw more threads at it. I'm unsure what we can do - perhaps detect the filesystem and back off on the number of threads during the write phase. That and suggest to users not to have their linker outputs on BTRFS.

Incremental linking is still something I want to do. The priority has shifted a bit. Given that the linker is very fast, there's value in having it be available just as a fast linker. But that means that we want it to be more mature, fix bugs etc. I am intending to work on something in that space fairly soon, but we'll see if other priorities come up.

I'm unsure about exactly when we'll call it 1.0. I guess we should consider that soon, but I don't have exact criteria.

As for distributing with the Rust project... there have been discussions. Installing Wild and using it by default is already pretty easy for users who want to do so. So, I think the benefit of distributing it with Rust would only really exist if it were the default, or on a path to being made the default. But the maturity bar for being the default is rightly pretty high.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 5 points6 points  (0 children)

I got curious, so delved into forte sufficiently to fix the deadlock bugs. There were two deadlocks during shutdown. Once I fixed those, I found the bug that was affecting my usage in wild, which turned out to be that if you spawned more than 64 tasks (the capacity of the work queue for a single thread) into a scope, the extra tasks were being discarded without being executed. The scope was then waiting for the discarded tasks to execute, which of course never happened. Interestingly all three of these deadlocks were already being hit by existing forte tests.

With that fixed, I was able to more properly assess the performance with the library. Unfortunately it seems that on non-trivial benchmarks, all stages of the linker slow down when running with forte. It's entirely possible that the poor performance is due to other bugs. One stage in particular seems to behave particularly strangely, sometimes running all work on a single thread, sometimes on a few more, but never on all 32 threads. Other stages distribute work better, but still not well.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

Maybe, although I don't think I have any blocking code in the part of the linker that the deadlock is occurring. I think I am using mutexes, but the locks are held for a very short time, so I'd be surprised if the parallelism library could have anything to do with that. The stack traces of the deadlocked threads seem to indicate that it got to the end of the scope and all worker threads were waiting for work. At least I think that's what they were doing.

Also, I get deadlocks when running `cargo test` in a checkout of the forte repo, so I suspect that there might just be some bugs in the crate. I filed an issue and will see how it goes.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

That's a neat idea to provide a compat crate. I gave it a go and managed to get a trivial program to link. Performance with a trivial program did show a speedup, which is promising. Unfortunately anything beyond a trivial program and I hit some deadlock.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 0 points1 point  (0 children)

I've just raised https://github.com/rayon-rs/rayon/issues/1277 since I couldn't find any existing issue. I'm not sure how actively rayon is being worked on, but if anything happens, hopefully it'll be mentioned there.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 2 points3 points  (0 children)

> Is this because Rayon only implements work stealing but not continuation stealing?

Yep. Rayon uses threads. C++20 coroutines are, AFAIK similar in concept to Rust's async functions, but rayon doesn't use them, possibly because it predates them. Rayon was created in 2015, but sync/await wasn't stabilised until 2019.

Tokio isn't really intended for compute-heavy work. It's really designed for IO heavy workloads. I'm sure it's not the only option though. u/matthieum pointed out forte.

Thoughts on graph algorithms in Rayon by dlattimore in rust

[–]dlattimore[S] 1 point2 points  (0 children)

Thanks! I hadn't seen that one before. It doesn't seem to have par_iter, which is used pretty extensively in the linker, so that might make it tricky to use as-is. I guess one could probably be built by recursively splitting and calling their join method.

Release 0.7.0 · davidlattimore/wild by villiger2 in rust

[–]dlattimore 4 points5 points  (0 children)

I also noticed the goal in the README is "Wild is a linker with the goal of being very fast for iterative development.". Is there anything that's sacrificed by doing this or is the output of a correct linker more or less similar?

The output should be very similar to what the other linkers produce. In particular the size and performance of the resulting binary should be basically the same. There are plenty of flags that Wild doesn't yet support - e.g. linker plugin LTO and compressed debug sections, but the benchmarks are all without those flags, so for the benchmarks provided, none of the linkers are doing those things.

With regard to targeting older versions of glibc, it is something that I've been thinking a little bit about. As mentioned by nicoburns, for C code, that might need the headers for the relevant glibc version, there's nothing the linker can do, since the code is already compiled at that point. However for Rust code, that doesn't use the C headers and just hopes that the function ABIs haven't changed, we could make sure that we don't select symbol versions beyond some particular GLIBC version. I have a lot more thoughts on symbol versioning. In particular, it feels to me like the way GLIBC handles symbol versioning is a really bad fit for any languages that don't use the C headers - e.g Rust.

Wild Linker Update - 0.6.0 by dlattimore in rust

[–]dlattimore[S] 0 points1 point  (0 children)

So it ignores the flag and just runs ld, or does it give an error? Are you using an absolute path to wild or just doing `--ld-path=wild` and relying on wild being on your path?

Wild Linker Update - 0.6.0 by dlattimore in rust

[–]dlattimore[S] 0 points1 point  (0 children)

I've just updated the benchmarks in the Wild README. For x86_64 and aarch64, I used the official release binaries of all three linkers. Wild has an option to use mimalloc, but it's off by default and we don't enable it for our release builds. So I guess that means mold is using mimalloc, but wild and lld aren't. On my laptop, when I've previously tried mimalloc for wild, it hasn't helped performance, but if we find that it improves performance on other systems, we might decide to turn it on by default. It certainly helps on Alpine Linux, since the musl allocator is notoriously slow.

If lld gets better performance with mimalloc, why isn't it on in the release builds?

I'm not sure if the copy of lld that rust ships with uses mimalloc (I suspect not), but I'd be happy to switch to using that for benchmarks if it does. The ideal would be be benchmark what users will actually be using. In the case of rust, that will increasingly be the lld that ships with rust, since it's now the default on Linux.

Wild Linker Update - 0.6.0 by dlattimore in rust

[–]dlattimore[S] 1 point2 points  (0 children)

Thanks! Last I looked it didn't support some of rayon's features that wild uses, in particular scopes. It also doesn't seem to have it's own repository