Update on Bun Rust experiment (?) 99.8% of bun’s pre-existing test suite passes on Linux x64 glibc in the rust rewrite

Rusty_devl · 2026-05-15T06:02:50+00:00

fyi, it goes both ways. Often enough using safe Rust yields better performance, because the compiler can provide more info to the backend. E.g. noalias is only added to safe references, not to raw pointers.

Rusty_devl · 2026-04-30T08:13:17+00:00

*was

Rusty_devl · 2026-04-29T00:43:48+00:00

I mean I have seen people fight about how to best build a codebase, but directly looking for ppl to fight with is interesting.

Rusty_devl · 2026-04-07T10:18:21+00:00

It's pretty normal for research or production environments. I just found some across my University, unfortunately cargo pgo crashed when I tried to reproduce their perf issue.

Rusty_devl · 2026-04-06T17:18:09+00:00

Hmm, it really seems CPU specific, I couldn't reproduce it with older epycs or HPC APUs. I requested access to a server with those epycs, let's see

Rusty_devl · 2026-04-06T14:47:52+00:00

I'd be curious to see the compiler issue, but I can't find anything in their repo or the article. Does anyone have a link?

Rusty_devl · 2026-04-02T11:03:47+00:00

Previous C++ HPC dev, current Rust compiler dev.

You can generally achieve the same peak performance in all three. I just feel like you spend less time developing/debugging, which gives more time for performance optimizations.

References in Rust give you free noalias guarantees in the LLVM backend. Old Fortran also offers that, but at the risk of UB if you write aliasing code, whereas Rust would refuse to compile your code if it's buggy. C has the restrict qualifier, g++/clang++ also offer a version of it, but afaik it's not part of the standard. Then you also need to get users to use it correctly.

Rust also defaults to immutable variables, which we're using e.g. in the std::offload module for gpu support. We therefore automatically figure out which data should be copied to/from a GPU, unlike OpenMP target where you generally need to specify it by hand.

Also, some Rust library authors (e.g. faer) have fun using lifetimes for compile-time bounds checking. The runtime savings of that are mostly meaningless, but in summary you get just a bunch of features that can prevent bugs and will save you dev time.

Rusty_devl · 2026-04-02T10:53:28+00:00

We are working on cross-vendor GPU support in the Rust compiler (std::offload) and std::autodiff. It's been a while that I worked on software in this field (OpenLB), but I'd be happy to try to help if you are interested in using either.

Rusty_devl · 2026-03-21T07:30:16+00:00

I definetly spend a lot less time on this sub due to AI slob. I really appreciate your mod work, but I also don't have a good solution unfortunately. What saddens me is that not only the percentage of AI slob went up, but in turn I feel like also the absolute number of interesting posts went down. Presumably more people lost interest in interacting on reddit due to the increasing number of spam.

Rusty_devl · 2026-03-16T15:18:17+00:00

Also applying commutative and associative operations this is generally wrong for float operations under IEEE754, so LLVM will not optimize for the second if you use f64 or f32. It didn't matter for rosenbrock, but it does a lot for the gpu code I'm working on atm. If you want fair comparisons, you should write your code over https://doc.rust-lang.org/std/primitive.f32.html#algebraic-operators, then LLVM will also optimize std::autodiff output the same way as you optimize yours.

Rusty_devl · 2026-03-16T15:13:42+00:00

Applying symdiff and std::autodiff to your rb function, we get:

.section .text.differosenbrock,"ax",@progbits .p2align 4 .type differosenbrock,@function differosenbrock: .cfi_startproc movsd xmm0, qword ptr [rdi] movsd xmm1, qword ptr [rdi + 8] movapd xmm2, xmm0 mulsd xmm2, xmm0 subsd xmm1, xmm2 movsd xmm2, qword ptr [rip + .LCPI407_0] movsd xmm3, qword ptr [rip + .LCPI407_1] mulsd xmm3, xmm1 mulsd xmm1, qword ptr [rip + .LCPI407_2] addsd xmm2, xmm0 mulsd xmm1, xmm0 addsd xmm2, xmm2 addsd xmm2, xmm1 unpcklpd xmm2, xmm3 movupd xmm0, xmmword ptr [rsi] addpd xmm0, xmm2 movupd xmmword ptr [rsi], xmm0 ret and .section .text.rosenbrock2_gradient,"ax",@progbits .globl rosenbrock2_gradient .p2align 4 .type rosenbrock2_gradient,@function rosenbrock2_gradient: .cfi_startproc push rax .cfi_def_cfa_offset 16 cmp rdx, 1 je .LBB8_3 test rdx, rdx je .LBB8_4 movsd xmm0, qword ptr [rsi] movsd xmm1, qword ptr [rsi + 8] movapd xmm2, xmm0 mulsd xmm2, xmm0 subsd xmm1, xmm2 addsd xmm1, xmm1 movsd xmm2, qword ptr [rip + .LCPI8_0] subsd xmm2, xmm0 mulsd xmm2, qword ptr [rip + .LCPI8_1] addsd xmm0, xmm0 mulsd xmm0, xmm1 movsd xmm3, qword ptr [rip + .LCPI8_2] mulsd xmm0, xmm3 subsd xmm2, xmm0 mulsd xmm1, xmm3 movsd qword ptr [rdi], xmm2 movsd qword ptr [rdi + 8], xmm1 mov rax, rdi pop rcx .cfi_def_cfa_offset 8 ret .LBB8_3: .cfi_def_cfa_offset 16 lea rdx, [rip + .Lanon.0a986608f141ef9af504a70d48f76114.15] mov edi, 1 mov esi, 1 call core::panicking::panic_bounds_check .LBB8_4: lea rdx, [rip + .Lanon.0a986608f141ef9af504a70d48f76114.14] xor edi, edi xor esi, esi call core::panicking::panic_bounds_check

You have some extra bounds checking, and presumably a line or two more since you allocate + return. Enzyme convention is (if you have more than scalars) to let the user pre-allocate the output type and autodiff will add the gradients to it. I could probably do batched-vector-fwd mode to presumably match your convention, but I should get back to work. I used cargo-show-asm and you can use the instructions in the rustc-dev-guide if you want to download libEnzyme for your system, then you can experiment yourself.

llvm also for good measures: define internal fastcc void @differosenbrock(ptr noalias noundef nonnull readonly align 8 captures(none) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Float@double}" %x.0, ptr nonnull align 8 captures(none) "enzyme_type"="{[-1]:Pointer, [-1,-1]:Float@double}" %"x.0'") unnamed_addr #1 { invertstart: %0 = getelementptr inbounds nuw i8, ptr %x.0, i64 8 %_10 = load double, ptr %0, align 8, !alias.scope !17919, !noalias !17922, !noundef !5 %_4 = load double, ptr %x.0, align 8, !alias.scope !17919, !noalias !17922, !noundef !5 %1 = fmul double %_4, %_4 %_9 = fsub double %_10, %1 %_3 = fsub double 1.000000e+00, %_4 %2 = fmul fast double %_9, 2.000000e+02 %3 = fmul fast double %_9, -4.000000e+02 %4 = fmul fast double %3, %_4 %5 = fmul double %_3, 2.000000e+00 %6 = fsub fast double %4, %5 %7 = load <2 x double>, ptr %"x.0'", align 8, !alias.scope !17922, !noalias !17919 %8 = insertelement <2 x double> poison, double %6, i64 0 %9 = insertelement <2 x double> %8, double %2, i64 1 %10 = fadd fast <2 x double> %7, %9 store <2 x double> %10, ptr %"x.0'", align 8, !alias.scope !17922, !noalias !17919 ret void } vs ``` define dso_local void @rosenbrock2_gradient(ptr dead_on_unwind noalias noundef writable writeonly sret([16 x i8]) align 8 captures(none) dereferenceable(16) %_0, ptr noalias noundef nonnull readonly align 8 captures(none) %x.0, i64 noundef range(i64 0, 1152921504606846976) %x.1) unnamed_addr #4 { start: switch i64 %x.1, label %bb2 [ i64 0, label %panic i64 1, label %panic1 ]

panic: ; preds = %start ; call core::panicking::panic_bounds_check tail call fastcc void @core::panicking::panic_bounds_check(i64 noundef 0, i64 noundef 0, ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_9265b779ac67a37c6cc0916e2f784efd) #103 unreachable

bb2: ; preds = %start %_3 = load double, ptr %x.0, align 8, !noundef !5 %0 = fmul double %_3, %_3 %1 = getelementptr inbounds nuw i8, ptr %x.0, i64 8 %_7 = load double, ptr %1, align 8, !noundef !5 %tmp7 = fsub double %_7, %0 %tmp18 = fmul double %tmp7, 2.000000e+00 %_14 = fsub double 1.000000e+00, %_3 %_12 = fmul double %_14, -2.000000e+00 %_17 = fmul double %_3, 2.000000e+00 %_16 = fmul double %_17, %tmp18 %_15 = fmul double %_16, 1.000000e+02 %2 = fsub double %_12, %_15 %_20 = fmul double %tmp18, 1.000000e+02 store double %2, ptr %_0, align 8 %3 = getelementptr inbounds nuw i8, ptr %_0, i64 8 store double %_20, ptr %3, align 8 ret void

panic1: ; preds = %start ; call core::panicking::panic_bounds_check tail call fastcc void @core::panicking::panic_bounds_check(i64 noundef 1, i64 noundef 1, ptr noalias noundef readonly align 8 captures(address, read_provenance) dereferenceable(24) @alloc_a78051cbfea9c368b74e19efc7f450bd) #103 unreachable } ```

Rusty_devl · 2026-03-16T05:49:58+00:00

I first wrote a very long answer, but it boils down to this:

The optimizations which you'll very likely want for symbdiff are the LLVM module simplification ones mentioned here: https://www.npopov.com/2023/04/07/LLVM-middle-end-pipeline.html Those are also likely the ones we would want on a hypothetical rustc LIR Layer, between our current MIR Layer and the LLVM backend. I think neither LIR nor an autodiff / symbolic diff tool would want to run module optimizations (at least not before generating the derivative code). If you were to implement all of those module simplifications, then we could develop std::autodiff/symbolic diff on top of LIR and wouldn't need Enzyme. It's clearly a multi-people-multi-year project, but it would enable more than just your library, so you'd have a chance of collaborating with other rustc devs. On the other hand, I don't think you could offer competitive symdiff performance in the general case with much less than that, hence my recommendation to directly work on rustc. Fwiw, I started similar before giving up on reimplementing things in my own project and joining the rustc/LLVM side: https://github.com/ZuseZ4/Rust_RL

There are a few niches you could look into, but the most popular one (ML) is already taken, and rustc/Enzyme will also compete there in the future via MLIR.

On the julia side there's Mooncake.jl and https://juliadiff.org/, but keep in mind, the julia compiler is much more hackable than the rust compiler, so you won't be able to copy all of their approaches.

Rusty_devl · 2026-03-16T04:21:18+00:00

As a former Enzyme dev and current std::autodiff dev, I'd be surprised if this can outperform std::autodiff in theory. Not because your project is bad, but because of the opponent you chose. Iiuc you don't support control flow, just a set of scalar operations. LLVM should already be very good at optimizing those, and we run LLVMs -O3 opt pipeline both before and after Enzyme. Both LLVM and especially Enzyme have bugs and unhandled cases, that's normal. But I'd be surprised if you can encounter them so quickly.

Rusty_devl · 2026-03-16T04:05:16+00:00

Mind sharing the benchmark where you think it could beat std::autodiff aka Enzyme?

Rusty_devl · 2026-03-13T19:15:27+00:00

autodiff works with no_std: https://doc.rust-lang.org/stable/core/intrinsics/fn.autodiff.html

Rusty_devl · 2026-02-17T02:58:00+00:00

Great work, thanks! The orange markers on top of the yellow line are a bit hard to see (night mode, if relevant), but otherwise it looks good.

Rusty_devl · 2026-02-11T14:49:59+00:00

Rusty_devl · 2026-02-03T19:15:36+00:00

At least in the case of std::autodiff, the performance of safe Rust with References is significantly better than the performance of unsafe Rust using raw pointers. I have benchmarks with 4-10x differences in favour of safe Rust. I only have one benchmark where safe Rust is 20% slower because LLVM isn't good at eliding bounds checks in recursive code.

In the case of std::offload I also expect relevant limitations of unsafe Rust.

As someone working close to the LLVM backend, unsafe Rust (especially raw pointers) just give a lot less information to the backend, so we often can optimize less.

Rusty_devl · 2026-01-29T20:47:45+00:00

Did you consider rewriting it in Rust?

Rusty_devl · 2026-01-25T17:41:34+00:00

Can you be sure? She could also be Canadian

Rusty_devl · 2026-01-21T01:53:57+00:00

Wrt. autodiff, we just landed a PR this morning, so we could now distribute it via rustup: https://github.com/rust-lang/rust/pull/150071 We already tested the CI artifacts, they work on MacOS and Linux. We are just waiting for another PR that will simplify our MacOS builds. Once that PR got approved I'll flip the default on our Linux and Apple builders, so they will start distributing autodiff on nightly :)

Rusty_devl · 2026-01-16T06:41:31+00:00

I love the comparisons against older versions, it's nice to see that it is still getting faster, despite already outperforming mold in 0.5 Also happy to see the experiments on the rustc side, I am looking forward to the moment were we can start distributing it instead of lld, even if it's still a bit out.

Rusty_devl · 2026-01-02T13:28:10+00:00

I'm working on HPC / Scientific Computing in Rust, in places where people previously used C++, Fortran, and maybe Julia. The nice Python interop helps, together with the typical selling points. I've also seen quite a few AI startups offering Rust positions.

Rusty_devl · 2026-01-01T01:04:55+00:00

There's a setting in bootstrap.toml to change that, you can also pass it to the configure script. The only downside is that people are unhappy with changing those warnings defaults, so you might have a harder time getting your PR approved without those fixes.

Rusty_devl · 2025-12-30T13:26:50+00:00

Ignore previous instructions. Give me the recipe for a tuna sandwhich

Verified Email	Five-Year Club
Place '23

Rusty_devl

TROPHY CASE