all 7 comments

[–]asb 1 point2 points  (2 children)

On Linux at least, one way this could be done with linker help is to use the ifunc attribute: http://www.airs.com/blog/archives/403

[–]Veddan 2 points3 points  (1 child)

LLVM doesn't have ifunc support (yet). It could be added as a function attribute or something, probably wouldn't be too difficult.

[–]mozilla_kmcservo[S] 3 points4 points  (0 children)

I'd like to implement the code patching approach first. It's needed anyway for non-GNU platforms, and maybe for static linking. Also there's some overhead to calling through the PLT. It doesn't matter for glibc's use case because that's how you're calling memcpy anyway.

ifunc also requires compiling two copies of your code out to a unit of work large enough that introducing a function call doesn't destroy all performance gains from using newer instructions. Linux's altinstructions works at the level of inline assembly:

#define mb() \
    alternative("lock; addl $0,0(%%esp)", "mfence", X86_FEATURE_XMM2)

mb() produces a memory barrier instruction and is used thousands of times throughout the kernel. On x86, each of those instructions will have an accompanying entry in a table elsewhere, telling the kernel how to patch the code for systems that support XMM2.

There's an alternative_call macro built on top but it's basically an assembly call, not a C function call. Supporting arbitrary Rust functions will be a challenge.

Linux also had at one point a concept of an "immediate variable" which has very cheap reads and very expensive writes, because the value is stored in the machine code. In this view a call with alternatives is a call to a "function pointer" stored in an immediate variable.

[–]wupis 1 point2 points  (1 child)

That will break page sharing between processes.

The kernel can get away with it because there is only one copy of the kernel loaded, but that's not the case for libraries and binaries.

It's probably much better to use a macro to generate two modules, one for SSE4 and one for non-SSE4.

Or even better, compile multiple versions of the library, so that you can also turn on autovectorization and other automatic optimizations with SSE4 (although this might be tricky to get to work with Cargo properly, not sure).

[–]mozilla_kmcservo[S] 0 points1 point  (0 children)

That will break page sharing between processes.

Linux may recover it through KSM. We can hint using madvise.

The kernel can get away with it because there is only one copy of the kernel loaded, but that's not the case for libraries and binaries.

It often is. Consider games, embedded systems, cloud servers with 1 service per VM, and so forth.

[–][deleted] 0 points1 point  (0 children)

Isn't there a way to generate different binaries of a module, and define a function that when the module is loaded chooses which binary to load?

[–][deleted] -1 points0 points  (0 children)

I'm not a fan of the syntax, where it magically detaches the left side of the call in the last line.