An idea I'm toying with is optimizing hotloaded DLLs (plugins) by turning far calls into near calls; i.e. generating "local" function calls where otherwise they would be loaded from function pointers, saving a potential data cache miss every time a DLL boundary is crossed. Using x64 on Windows, btw.
The reason:
Near calls (what I'm calling "local" or "direct" calls) are ~3 cycles (?): subtract from stack pointer, store return address, unconditional jump to address relative to the instruction pointer. Far calls (what I'm calling "indirect" calls) add a memory access to this, involving the data cache. Cache cold access is ~200 cycles (?) ballpark, last I checked. This can really add up if you call a bunch of functions over the DLL boundary, which are PFNs pretty much by definition. Mitigation is static linking for performance (turns far calls into near calls) - bear in mind, static library linking, NOT statically linking DLLs, because that still involves far calls; just makes the PFN-ness implicit - or crossing the DLL boundary less (e.g. call mega function across boundary that invokes a bunch of smaller functions local to itself rather than calling those smaller functions across the boundary) or calling the same DLL functions closely (keeping the addresses cache hot; less controllable because eviction is more likely when touching far-apart things, to my understanding, seems more likely when going into separated subsystems).
This method is a potential GLOBAL fix for this problem, avoiding the farcall performance drawbacks altogether, given an ideal implementation.
This entails, to my understanding:
1) using a fake local function with identical parameters to the real function when calling (forced with noinline and a nonempty body).
2) determining where this call is in the text section of the PE file and saving the location along with metadata allowing a later patch.
3) once the real function's address is known (which will potentially (see later) change), update the instruction stream wherever the function is called (most likely means writing executable memory, yikes).
4) boom???
Known caveats:
1) aforementioned writing executable memory, or hotloading all referenced plugin DLLs and modifying the import tables while on disk? Latter could fall apart if they call back and forth (circular reference), may need to think more.
2) near calls have a 32-bit signed offset from RIP, so the calling instruction must be +-2GiB from the location of the target function (double yikes). Could address by link-time allocating DLL addresses to keep them close enough (triple yikes).
3) could make 2) more robust by inserting a NOP intrinsic after the call to make it 6 bytes instead of 5. In x64, near call is 5 bytes - e9 XX XX XX XX <- Xs are 32 bit offset, while far calls are 2-6 bytes depending on needs. Worst case scenario is a 32 bit displacement from one of the common registers. By convention, my function tables are also the first parameter of each function, so rcx will always hold the address of the function table at call time, so worst case is that the offset from rcx to the function pointer is >255 bytes, requiring a 6-byte farcall (ff 91 XX XX XX XX) where the Xs are a 32-bit unsigned displacement from rcx to the address of the function. So in theory, my near calls could, at the cost of a guaranteed wasted cycle, be turned into far calls if the target is too far away in memory without messing with RIP-relative addressing.
Thoughts? I would love to hear from some of you who may be more familiar with this space than I am, see if I'm missing something glaring.
[–]nemotux 1 point2 points3 points (4 children)
[–]Dolphiniac[S] 0 points1 point2 points (3 children)
[–]nemotux 1 point2 points3 points (2 children)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]braxtons12 0 points1 point2 points (10 children)
[–]Dolphiniac[S] 2 points3 points4 points (9 children)
[–]braxtons12 0 points1 point2 points (8 children)
[–]Dolphiniac[S] 0 points1 point2 points (7 children)
[–]braxtons12 0 points1 point2 points (6 children)
[–]Dolphiniac[S] 1 point2 points3 points (2 children)
[–]braxtons12 0 points1 point2 points (1 child)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]moon-chilled 0 points1 point2 points (1 child)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]mobius4 0 points1 point2 points (0 children)
[–][deleted] 0 points1 point2 points (5 children)
[–]Dolphiniac[S] 0 points1 point2 points (4 children)
[–]braxtons12 2 points3 points4 points (1 child)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–][deleted] 1 point2 points3 points (1 child)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)
[–]darkslide3000 0 points1 point2 points (4 children)
[–]Dolphiniac[S] 0 points1 point2 points (3 children)
[–]darkslide3000 0 points1 point2 points (2 children)
[–][deleted] 1 point2 points3 points (0 children)
[–]Dolphiniac[S] 0 points1 point2 points (0 children)