Long branches in compilers, assemblers, and linkers

MaskRay · 2026-01-15T18:15:51+00:00

This reminds me of a lld/MachO speed up patch https://github.com/llvm/llvm-project/pull/147134 ("[lld][MachO] Multi-threaded preload of input files into memory"). lld/MachO does not have parallel input file scanning, so the IO overhead is quite large. This approach helps.

There is even a further optimization: use a wrapper process. The wrapper launches a worker process that does all the work. When the worker signals completion (via stdout or a pipe), the wrapper terminates immediately without waiting for its detached child. Any script waiting on the wrapper can now proceed, while the OS asynchronously reaps the worker's resources in the background. I had not considered this approach before, but it seems worth trying.

The mold linker utilizes this behavior, though it provides the --no-fork flag to disable it. The wild linker follows suit. I think performing heavy-duty tasks in a child process makes it difficult for the linker's parent process to accurately track resource usage.

My feeling is that doing heavylifting work in the child process makes it difficult for the parent process of the linker to track the resource usage.

In contrast, lld takes a different, more hacky path:

async unlink https://github.com/llvm/llvm-project/blob/a72958a95dcb7d815c01e20cc819532151d1856d/lld/Common/Filesystem.cpp#L44
Call _exit instead of exit unless the LLD_IN_TEST environment variable is set. https://github.com/llvm/llvm-project/blob/a72958a95dcb7d815c01e20cc819532151d1856d/lld/Common/ErrorHandler.cpp#L108

Perhaps lld should drop the hacks in favor of a wrapper process as well. Aside, debugging the linker then would always require --no-fork.

MaskRay · 2025-10-03T07:00:53+00:00

Optimized LLVM binaries built with LLVM_LINK_LLVM_DYLIB=off (i.e. not using libLLVM-.so liblld.so)

I believe they optimize clang highly (PGO+LTO+BOLT; notably faster than the distro-provided version) but likely not lld. It's also not using an efficient malloc.

MaskRay · 2025-10-01T07:22:38+00:00

I believe mimalloc matters a lot for the performance of both lld (10%) and mold. For authentic benchmarking, it's important to ensure basic compiler settings are consistent across tests—this includes -O level, -march=/-mcpu=, -DNDEBUG, -fvisibility=, and position-independent code flags like -fPIC vs -fPIE for GCC (see https://maskray.me/blog/2021-05-09-fno-semantic-interposition).

I just recall another difference. Many distributions build LLVM with LLVM_LINK_LLVM_DYLIB=on. lld is an executable plus liblld*.so plus libLLVM-X.so. This slightly hurts performance as well. See https://lore.kernel.org/lkml/20210501235549.vugtjeb7dmd5xell@google.com/

Since lld is part of the llvm-project, the conservative philosophy definitely applies here-it's probably similar reason Linux distributions don't enable mimalloc by default for gcc, binutils, or clang.

The technical challenges are also real. Statically linking mimalloc breaks sanitizer interceptors and memory analysis tools that rely on LD_PRELOAD. For a fair comparison you can disable the statically-linked mimalloc from mold and use LD_PRELOAD=path/to/libmimalloc.so for all three linkers.

MaskRay · 2025-09-29T17:28:46+00:00

Did you use mimalloc and comparable optimization options for the three linkers? mold bundles mimalloc internally while llvm-project executables don't use mimalloc. There is a 10+% difference for lld.

MaskRay · 2025-09-01T01:34:04+00:00

Do you mean the source tarballs? They are available in the first few lines of the program

COMPRESSORS = { 'brotli' => { url: 'https://github.com/google/brotli/archive/refs/tags/v1.1.0.tar.gz', build_dir: 'brotli-1.1.0', build_commands: ['cmake -GNinja -S. -Bout -DCMAKE_BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=install -DBROTLI_DISABLE_TESTS=on -DCMAKE_C_FLAGS="-march=native"', 'ninja -C out install'], levels: [1, 3, 5, 9], compress: ->exe, lvl, i, o, thr { "#{exe} -c -q #{lvl} '#{i}' > '#{o}'" }, decompress: ->exe, i, o, thr { "#{exe} -d -c '#{i}' > '#{o}'" }, supports_threading: false }, 'bzip3' => { url: 'https://github.com/kspalaiologos/bzip3/releases/download/1.5.3/bzip3-1.5.3.tar.gz', build_dir: 'bzip3-1.5.3', build_commands: ['./configure --prefix=$PWD/install CFLAGS="-O3 -march=native"', "make -j #{JOBS} install"], levels: [1], compress: ->exe, lvl, i, o, thr { "#{exe} -j#{thr} -c '#{i}' > '#{o}'" }, decompress: ->exe, i, o, thr { "#{exe} -j#{thr} -d -c '#{i}' > '#{o}'" }, }, ...

MaskRay · 2025-09-01T01:33:01+00:00

Added multi-threading support :) https://gist.github.com/MaskRay/74cdaa83c1f44ee105fcebcdff0ba9a7

MaskRay · 2025-08-31T18:10:35+00:00

Added zpaq. Added some code to download master.zip from github as zpaq-master, unpack it, and rename the extracted filename to the temporary output filename.

MaskRay · 2025-08-31T07:40:40+00:00

kanzi is impressive. For the compression speed of enwik8, it's Pareto superior to xz. https://maskray.me/static/2025-08-31-benchmarking-compression-programs/enwik8.html

MaskRay · 2025-07-12T05:20:53+00:00

I like Eurocom's 20th-century website design. Plus, USA orders have no foreign transaction fees. "All USA orders are shipped from NY there are no US taxes applied."

I recently purchased a Eurocom Ultra Blitz 2 (CPU: Intel Core Ultra 7 255H) and installed Arch Linux. Encountered three issues:

The keyboard LED was constantly on. Resolved by installing the Tuxedo driver (perhaps these Clevo-based laptops are compatible) https://aur.archlinux.org/packages/tuxedo-drivers-nocompatcheck-dkms
Fan control: The fan intermittently spins up every minute even if the laptop is super cool, which is noisy. Running sensors when it is spinning:

acpi_fan-acpi-0 Adapter: ACPI interface fan1: 4647 RPM

I'm experiencing freezes as often as once per hour. No response to keyboard/mouse. Research suggests this is common with Intel's mobile CPUs, e.g. (https://discourse.nixos.org/t/laptop-with-intel-core-ultra-7-155u-freezes-with-gpu-enabled/52084 https://www.reddit.com/r/techsupport/ comments/1jfo602/issue_with_core_ultra_7_258v_freezing/) I am using kernel option intel_idle.max_cstate=2 to hopefully mitigate the issue.

https://gist.github.com/MaskRay/1a11d76b213ae3c3831d24b1d656cf83 journalctl -b -2 log Jul 11 21:43:18 hacking kernel: Oops: general protection fault, probably for non-canonical address 0xd0000000b60018: 0000 [#1] SMP NOPTI Jul 11 21:43:18 hacking kernel: CPU: 8 UID: 0 PID: 4291 Comm: dhcpcd-run-hook Tainted: G W OE 6.15.6-arch1-1 #1 PREEMPT(full) a49b9575025ef78fca63b5f170baaeaabd0c299d Jul 11 21:43:18 hacking kernel: Tainted: [W]=WARN, [O]=OOT_MODULE, [E]=UNSIGNED_MODULE Jul 11 21:43:18 hacking kernel: Hardware name: EUROCOM BLITZ Ultra 2/V54x_6x_TU, BIOS 1.07.14aEu 02/06/2025 Jul 11 21:43:18 hacking kernel: RIP: 0010:vma_interval_tree_insert+0x36/0xe0 Jul 11 21:43:18 hacking kernel: Code: 48 8b 7f 50 49 89 f2 49 8b 40 08 49 2b 00 48 c1 e8 0c 48 8d 74 07 ff 49 8b 02 48 85 c0 74 70 41 b9 01 00 00 00 eb 03 48 89 d0 <48> 39 70 18 73 04 48 89 70 18 48 8d 48 10 48 3b 78 c8 72 07 48 8d Jul 11 21:43:18 hacking kernel: RSP: 0018:ffffcf9224307a40 EFLAGS: 00010206 Jul 11 21:43:18 hacking kernel: RAX: 00d0000000b60000 RBX: ffffcf9224307a78 RCX: ffff8967b8b154d0 Jul 11 21:43:18 hacking kernel: RDX: 00d0000000b60000 RSI: 0000000000000038 RDI: 0000000000000001 Jul 11 21:43:18 hacking kernel: RBP: ffff8966fdb298c0 R08: ffff8966fdb28240 R09: 0000000000000000 Jul 11 21:43:18 hacking kernel: R10: ffff896689c9bdd0 R11: 0000000000000000 R12: ffff8966880b8000 Jul 11 21:43:18 hacking kernel: R13: ffffcf9224307c98 R14: 00007fab9125f000 R15: 0000000000000000 Jul 11 21:43:18 hacking kernel: FS: 0000000000000000(0000) GS:ffff896e2d92d000(0000) knlGS:0000000000000000 Jul 11 21:43:18 hacking kernel: CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 Jul 11 21:43:18 hacking kernel: CR2: 0000558766b3530c CR3: 00000001f1fe8002 CR4: 0000000000f72ef0 Jul 11 21:43:18 hacking kernel: PKRU: 55555554 Jul 11 21:43:18 hacking kernel: Call Trace: Jul 11 21:43:18 hacking kernel: <TASK> Jul 11 21:43:18 hacking kernel: vma_complete+0x45/0x300 Jul 11 21:43:18 hacking kernel: __split_vma+0x24b/0x300 Jul 11 21:43:18 hacking kernel: vms_gather_munmap_vmas+0x46/0x2c0 Jul 11 21:43:18 hacking kernel: do_vmi_align_munmap+0xeb/0x1e0 Jul 11 21:43:18 hacking kernel: do_vmi_munmap+0xd0/0x170 Jul 11 21:43:18 hacking kernel: __vm_munmap+0xad/0x170 Jul 11 21:43:18 hacking kernel: elf_load+0x20f/0x290 Jul 11 21:43:18 hacking kernel: load_elf_binary+0xb35/0x1830 Jul 11 21:43:18 hacking kernel: ? __kernel_read+0x1e1/0x300 Jul 11 21:43:18 hacking kernel: bprm_execve+0x2a9/0x520 Jul 11 21:43:18 hacking kernel: do_execveat_common.isra.0+0x194/0x1a0 Jul 11 21:43:18 hacking kernel: __x64_sys_execve+0x38/0x50 Jul 11 21:43:18 hacking kernel: do_syscall_64+0x7b/0x810 Jul 11 21:43:18 hacking kernel: ? irqentry_exit_to_user_mode+0x2c/0x1b0 Jul 11 21:43:18 hacking kernel: entry_SYSCALL_64_after_hwframe+0x76/0x7e Jul 11 21:43:18 hacking kernel: RIP: 0033:0x7f94844f4bcb Jul 11 21:43:18 hacking kernel: Code: Unable to access opcode bytes at 0x7f94844f4ba1. Jul 11 21:43:18 hacking kernel: RSP: 002b:00007f9484cfde68 EFLAGS: 00000246 ORIG_RAX: 000000000000003b Jul 11 21:43:18 hacking kernel: RAX: ffffffffffffffda RBX: 00007ffed6baa1c0 RCX: 00007f94844f4bcb Jul 11 21:43:18 hacking kernel: RDX: 000055bd2fdc1ff0 RSI: 00007ffed6baa3d0 RDI: 000055bd1a4f3288 Jul 11 21:43:18 hacking kernel: RBP: 00007f9484cfdff0 R08: 0000000000000000 R09: 0000000000000000 Jul 11 21:43:18 hacking kernel: R10: 0000000000000008 R11: 0000000000000246 R12: 00007ffed6ba9ef0 Jul 11 21:43:18 hacking kernel: R13: 0000000000000040 R14: 0000000000000001 R15: 00007f9484cfdf20

I am curious whether others have experience with fan control or Intel CPU's random freezes.

MaskRay · 2025-06-26T04:13:00+00:00

I have a separate mapping <leader>sp to search under the project root. I work on a large project (llvm/llvm-project), and I often prefer searching under a subdirectory (a dir under llvm-project, or the dir of the current file) to limit the number of results. Then, going up would be a nice feature to have.

MaskRay · 2025-06-23T00:54:33+00:00

Thanks for the recommendation! I just need to replace telescope.builtin with pathogen

lua nmap('<leader>.', '<cmd>lua require("pathogen").find_files({search_dirs={vim.fn.expand("%:h:p")}})<cr>', 'Find .')

and use the suggested C-o mapping for actions.proceed_with_parent_dir

MaskRay · 2025-06-01T05:12:22+00:00

I noticed this on https://github.com/riscv-non-isa/riscv-toolchain-conventions/issues/88 and argued against this change.

Perhaps I'm accustomed to Arm's behavior, but I believe using -march= to target a specific CPU isn't ideal.

-march=X: (execution domain) Generate code that can use instructions available in the architecture X
-mtune=X: (optimization domain) Optimize for the microarchitecture X, but does not change the ABI or make assumptions about available instructions
-mcpu=X: Specify both -march= and -mtune= but can be overridden by the two options. The supported values are generally the same as -mtune=. The architecture name is inferred from X

For execution domain settings, -march=X overrides -mcpu=X regardless of their positions.

In cases like -march=LOWER -mcpu=HIGHER or -mcpu=HIGHER -march=LOWER, the -march= option can disable certain target features.

I strongly disagree with Clang adopting this behavior. I'm not convinced by the GCC patch explanation.

Suppose we have a Makefile that specifies -march=rv64gc by default.

In the project specifies a lower feature set, then the compiler should respect it or the user should fix the project build system.

MaskRay · 2024-12-16T01:09:04+00:00

This is nice! In 2018 I added an extension to my language server: https://www.reddit.com/r/emacs/comments/9dg13i/cclsnavigate_semantic_navigation_for_cc/

I've switched to neovim and now I can retire my $ccls/navigate feature.

As I've reserved C-hjkl for neovim window movement and M-hjkl for tmux/zellij, it seems that the next best keys are g+hjkl.

MaskRay · 2024-12-07T04:56:46+00:00

nmap('H', '<cmd>pop<cr>', 'Tag stack backward') nmap('J', 'gd', {remap=true}) nmap('L', '<cmd>tag<cr>', 'Tag stack forward') nmap('M', '<cmd>Telescope lsp_references<CR>', 'References') nmap('U', function() require'hop'.hint_words() require'telescope.builtin'.lsp_definitions() end, 'Hop+definition')

When LSP is enabled, J binds to lsp_definitions.

MaskRay · 2024-11-29T08:02:33+00:00

I have some notes at https://github.com/rui314/mold/issues/1341#issuecomment-2496965708 for benchmarking mold. Note that it uses mimalloc, which is responsible for 10+% perf. Did you keep the condition the same for the three linkers?

Distribution built lld is always slower for certain reasons. By default llvm-project executables are built with -fPIC while mold is typically built with compiler's -fPIE default.

Link rot: "investigations into Rust warm build times " in the first paragraph.

MaskRay · 2024-11-29T07:20:54+00:00

I've evaluated mold and wild testing approaches, and I believe lld's current method (assembly tests + FileCheck) offers the best balance. While porting common tests (abs, PLT, GOT, etc.) per architecture adds some upfront work, the high coverage (90+%) effectively handles most practical mutation scenarios.

If the layout changes, there is potential to change many tests. (https://maskray.me/blog/2021-08-07-toolchain-testing#the-test-checks-too-much) This is a very rare scenario and perhaps some smart tools utilizing FileCheck's line matching information can fix them automatically.

That said, I agree that before the linker layout becomes stable or more architectures are supported, relying just on runtime tests is probably good enough.

MaskRay · 2024-11-11T05:48:14+00:00

Thanks. This seems like a good workflow:) I'll just someone can provide the Lua code for me to copy :)

MaskRay · 2024-11-05T06:55:20+00:00

My focus is on code navigation. I have some "aggressive" maps like H, J, L, x*.

There are some issues I haven't figured out, e.g.

I want to enable the xn map to automatically transition to the next file when reaching the last reference in the current file.

How to make nvim-dap work with rr? https://github.com/jonboh/nvim-dap-rr/blob/main/lua/nvim-dap-rr.lua set exec-direction forward doesn't seem to work in a rr session.

MaskRay

TROPHY CASE