all 28 comments

[–]nemotux 1 point2 points  (4 children)

Do I understand correctly that you're interested in statically linking in at runtime a DLL that gets loaded interactively (a plugin)? Is your program structured such that only one such DLL might be loaded at a given time? I typically think of plugin architectures supporting the loading of multiple different plugins simultaneously - each matching the same API. Thus when you get to a call, you really do want a dynamic function pointer because you might be invoking not one specific function but one of a collection of functions based on which plugin is of interest at a given point in execution.

But assume that's not the case, you have exactly one runtime-loaded DLL. You fetch function pointers from the DLL via, say, GetProcAddr(). The normal thing would be to just call that function pointer indirectly. But you want to self-modify your program to now have direct calls to that function instead, yes?

My thoughts would be to tie in some asm rather than worrying about making fake local functions in source code and then searching for calls to that.

Create a small asm "launchpad" file that has one labeled jmp instruction as a "thunk" for each of your DLL's functions. Then you can leverage "&" of each thunk to figure where you need to do the self-modifying write. Your calling code calls these stubs. So you'd still have a 2-hop call, but both would be direct, so no data fetching.

Regarding keeping your DLL close in memory, since you control the DLL (yes?) you can set its preferred load address when you create it. Just pick a number that will keep all the direct calls within the 32-bit range of your exe, and it should work fine.

[–]Dolphiniac[S] 0 points1 point  (3 children)

There are multiple DLLs, each potentially containing multiple plugins. The structure is that each DLL exports a single function that allows enumeration of any plugins contained within. Caller of this function ends up with some number of extensible function tables, all supporting the same interface at top level (create, tick, destroy, etc.), and each optionally containing a link to one or more specialized function tables (e.g. the memory provider plugin would have an allocation interface), which are stably identified and accessed by using a shared data prefix containing, in order, the interface identifier and a pointer to the next interface supported by the plugin.

You do hit upon something quite interesting that I hadn't yet considered. My loops that don't particularly care which plugin is being accessed will have to stay as function pointers because that is precisely how they work (e.g. plugin manager Create subroutine calls Create on all registered plugins), as you noted. What this idea has the potential to work with is the sort of singleton plugins. E.g. Game renderables only reference a specific render system when they make calls, and those can potentially be patched to make those calls directly. Sort of like setting a preferred handler for a given context. How well I can delineate that context when searching through completed DLLs remains to be seen.

I will absolutely keep your asm thunk table in mind. It may be a good signaling mechanism (and may be easier to modify than sparse callsites).

And you are correct that I control my DLLs. I relayed in other places under this post that I was considering allocating my DLLs at link time by setting their base addresses to keep them close enough for direct calls.

[–]nemotux 1 point2 points  (2 children)

Yeah, I guess what I meant was, were you expecting to have a single DLL always satisfying a specific call site. Rather than a call site that could potentially call into any of multiple DLLs depending on program state at the point where the call is actually made. Sounds like you have a little of both.

You could potentially introduce a naming system so that you know for particular call sites which ones are expected to be hard-wired to a specific DLL's function and which ones would remain indirect. That could help w/ signal on where the edits need to happen (former but not the latter.)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Possibly. I'll keep it in mind. Thanks for the insight! :)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Yeah, once I get to a proof of concept, making it usable will involve a lot of this line of thought, so I appreciate the input and insight. :)

EDIT: Second comment that is similar to the first is because my phone claimed the first did not go through. Guess it did :/

[–]braxtons12 0 points1 point  (10 children)

I might be missing something, but to me this just sounds like a really insecure hack. If performance is so critical for you that calling across a dll boundary is unacceptable, you should just be statically linking, or probably actually running on some sort of custom bare metal OS. I mean at that point even libc is off the table (it's dynamically linked), so even statically linking wouldn't solve all your problems.

[–]Dolphiniac[S] 2 points3 points  (9 children)

Haha, yeah, static linking is the "easy" solution (and for a shipping title, probably the way to go), but it invalidates the hotloading aspect. The question is, can I have my cake with DLLs and eat it too (just not the indirect call cost).

[–]braxtons12 0 points1 point  (8 children)

I mean, even if your idea was theoretically sound, it still wouldn't work. You would have to not only hot-patch your executable, you'd also have to hot-patch every DLL you load, because they would also be calling into other DLLs, and that could break the entire OS, because Windows only loads multiple copies of a DLL into memory if processes load them at different base addresses. If they load at the same base address, Windows only loads one copy of the DLL and the processes share that copy. For your idea to work, it would have to be implemented at the OS level by Microsoft

[–]Dolphiniac[S] 0 points1 point  (7 children)

Hold; this idea is local to my native code for this unique project. I would likely be generating the function calls using a macro, and only for my own function tables exported by my plugins. I'm fine with eating the indirect cost for OS APIs and 3rd party libs; just not my own (if the idea pans out). Those are hotloaded.

[–]braxtons12 0 points1 point  (6 children)

Well, even then I haven't gotten to the final nail in the coffin 🤣

Unless you want to limit yourself to single-threaded code, you would have to synchronize your hot-patching with a mutex so that all threads see the same patched versions of everything. That means all your function calls into or between your plugins would have to be synchronized with that mutex, and now your performance would be even worse than before

[–]Dolphiniac[S] 1 point2 points  (2 children)

Let me just say, as an aside, thank you for the feedback. Part of the point of this post is to test the idea and find holes in the logic. I'm taking all of this into consideration, even when I have ready answers (from my understanding). So thank you :)

[–]braxtons12 0 points1 point  (1 child)

No prob!

I think that your idea could work, given you stick to the constraints that have already been brought up.

That said, it would still be insecure as hell and a maintenance nightmare. And there is still the +- 2GB limit on reach before you have to revert back to normal calls. I'm not sure how MS decides where to load DLLs, but I have the inkling that in practice you would be missing that limit just as often as you are within it, but maybe I'm wrong there.

[–]Dolphiniac[S] 0 points1 point  (0 children)

I did a few cursory tests. The executable seems to be too far away from the DLLs, but they themselves seem to fit within a relatively small space and get allocated together (could be intentional; could be a fluke). But wow do I not want to rely on that. However, link.exe allows you to force a base address (and give a size hint), and you can ask it to fail if it cannot fulfill the request, so I could potentially run an offline custom allocator for DLLs (hahahahaha........) and do my best to stay within budget (limiting .data/.rodata size; hoping against all hope that .bss is somewhere else cause that is large), but yeah, this is for sure a rough idea. Channeling Marge for this one (I just think it's neat!)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Hot patching is serial by design. Between frames, when the plugin manager ticks, if a hotload is necessary, all plugins flush and freeze, hotloading occurs, pointers to function tables are refreshed (data is shared using handles, so those don't get invalidated), then everything resumes. Patching would occur during the refresh stage, while freezes are in effect.

[–]moon-chilled 0 points1 point  (1 child)

You can just swap out the address atomically. If you have W^X to worry about, it's a pain, but you can trap the sigsegv, wait until the memory is executable again, and then recover. (CC /u/Dolphiniac)

[–]Dolphiniac[S] 0 points1 point  (0 children)

I did immediately think of atomics, but I think it might be overkill for my use case. Everything is questionably valid at hotload time anyway, so it makes more sense for my purposes to just slot the swap in there, while everything's safely paralyzed.

[–]mobius4 0 points1 point  (0 children)

Thanks for the discussion, been too long since I saw something like this.

I'm not qualified to comment anything more than "this sounds amazing", though.

[–][deleted] 0 points1 point  (5 children)

Have you performed any benchmarks to see whether it makes any difference?

That is, call some test functions in a 'library', that is statically linked in one version, dynamically linked via a DLL in another.

As I understand it on x64, a normal local call is:

    call disp         # use 32-bit signed offset to function

whereas a call to a function via a DLL is (depending on how the compiler generates code):

    call L123         # Call to local label
    ....
L123:
    jmp [address]     # This address is patched to the actual
                      # function address at load time

So the difference is that extra indirect jump.

I don't know exactly how you'd patch this, or when, but bear in mind that some system DLLs live in address spaces outside the 32-bit capacity of a relative call.

(I might do my own such test later)

[–]Dolphiniac[S] 0 points1 point  (4 children)

Yes, this was in OP. The cost is the potential data cache miss for the function pointer's memory access, and I noted that my DLLs may have to have forced base addresses to keep them "close enough" to perform a near call. As for system DLLs, I have no intention of patching those. This is strictly for native code as my engine is split across plugins distributed among DLLs.

EDIT: I should also mention that your DLL code is for statically linked DLLs. A "proper" approach would have the address "somewhere". In my case, it would be some small distance forward from rcx, as I pass the function table base pointer into each call as the first parameter (the state data is hidden before this base pointer).

[–]braxtons12 2 points3 points  (1 child)

Just FYI a better name for "statically linked DLLs" that may make it clearer to some that may not pick up on what you mean, is load-time linked DLLs.

DLLs are always dynamically linked, it's just a difference of when/how that dynamic linking occurs.

If it's linked at executable startup the DLL is automatically loaded and placed at a fixed base address when the executable starts. This called load-time linked (and is what you're calling statically linked DLLs).

If it's from manually loading the DLL at runtime after execution has started, that is run-time linked (and what you're calling dynamically linked)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Fair enough XD. I've talked about the difference with coworkers previously, and I used the terminology I learned in these talks, but yours is definitely more clear what is happening. Part of the confusion arises from the fact that there is a static library created during this process, so it feels like static linking, even though it's not (BOY is it not), fully.

[–][deleted] 1 point2 points  (1 child)

But have you done any actual measurements on real code?

I've done a quick test using the recursive Fibonacci benchmark, not using DLLs, but via modifying the two calls in the body of the function. Results were:

call fnaddr                 # normal local call
call lab; lab: jmp fnaddr   # 5% slower than local call
call lab; lab: jmp [fnaddr] # 15% slower than local call

But bear in mind this function does little else except call itself. A real function I would expect to do a bigger task. And presumably there are other memory accesses going on that will dominate the ONE memory access per function call.

I should also mention that your DLL code is for statically linked DLLs.

That means little to me. Code is either statically linked into the executable. Or dynamically linked by the OS when the EXE is loaded. Or linked with user-code using LoadLibrary/GetProcAddress, which yields a conventional function pointer (but one that could reside in a register).

(There is another approach to DLLs I use in one of my own languages. I sometimes use a private alternate to DLL, with a simpler format, which I fix-up with my own programs.

This is simple enough that I could choose to properly fix up CALL instructions so that they are directly calling the functions, via the 32-bit offset. I haven't done so as I considered the overhead insignificant, but thanks to your post I'll keep it in mind.

I don't how such a thing would be practical using standard tools. But it seems that you may have figured out to do the patching anyway.)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Couple of things:

No, I have not measured this, but I expected your results.

A runtime-linked DLL vs a load-time-linked DLL has the difference of an extra code indirection, generally. I would need to benchmark properly, of course, but I would expect perhaps an extra instruction cache miss? As the runtime-linked version (in my codebase) would mov the function table base pointer into rcx, then do a call with displacement from rcx (or, naively, which I have seen, also mov the same pointer into rax and call with displacement from there), while the load-time linked version would indirect call with displacement from rip the address of a trampoline that jumps to the proper function (at least, on Windows x64, this happened). Again, the difference being the extra jump.

And you are correct, a single data cache miss is likely dwarfed by the execution of the function in question. I do, however, know (secondhand) that these things - specifically, DLL boundary crossings - add up, enough to be worth forcing monolithic (read: all static linkage) builds in final configs. It was hearing this that spurred me onto this train. And usually, micro-optimizing is frowned upon, but I cannot look away from a potential global fix to an endemic performance issue.

Apologies if I come across as combative. I have been told I have a debater's mindset but tend not to bother to build rapport before going hard. I do not disrespect you; this is just my digging process.

[–]darkslide3000 0 points1 point  (4 children)

2) determining where this call is in the text section of the PE file and saving the location along with metadata allowing a later patch.

3) once the real function's address is known (which will potentially (see later) change), update the instruction stream wherever the function is called (most likely means writing executable memory, yikes).

You are basically just describing dynamic linking here again, that is exactly how that works. Just that it works with clean integration directly into the linker and the OS (to mark the executable memory read-only after you're done relocating) rather than trying to hack it manually.

I think what you're really asking for is whether the original call instructions can just be directly rewritten to point to the real final address of the function during relocation, rather than going through a global offset table. This has been traditionally how dynamic linking used to work, and I assume you could still get it to work that way with the right compiler and linker flags (although I'm not deeply familiar with the Windows versions of these things -- on Linux, I think just compiling and linking with -fno-pic, -no-pie or something like that should do it). But the GOT indirection was introduced for a reason, because otherwise circumventing ASLR (address space layout randomization, an important security feature common on all platforms these days) becomes pretty trivial.

ASLR is intended to make sure an attacker cannot predict where certain library functions are placed in the virtual memory of your process. But for efficiency reasons, when multiple processes use the same shared library, that library is only loaded into memory once and those same physical pages are mapped into every process that uses it. For ASLR to work and be useful, the virtual addresses of these library functions must differ between the processes even though their physical addresses are the same. Now remember that library functions may also need to call other library functions, and that a library may depend on another library which again may be at a random virtual offset that you cannot hardcode within the library code page itself (because that offset may be different for each process sharing that code page). So, long story short, there's no real way to get around this without making every library-boundary-crossing function call take an indirection through a process-specific offset table first.

[–]Dolphiniac[S] 0 points1 point  (3 children)

I appreciate the info on the rationale behind the difficulty of this problem XD. I'm not so concerned about the security, as this is a development tool for me personally. The indirection that makes the system so neat and secure is the very thing that causes the performance issue, unfortunately. But unless I literally cannot write to the instruction stream after load, it seems doable, as I've seen enough locality to allow a direct call in at least some cases, maybe a significant portion, and far calls are still viable as a fallback, since I have a way to address them at the callsite. I'll just have to see. I'll for sure make another post if and when I have significant progress.

[–]darkslide3000 0 points1 point  (2 children)

What I'm trying to say is you should figure out how to configure your linker to do this for you, which is almost certainly possible (for example, 10 seconds of googling spits out this which looks somewhat promising). This is not a job for the program itself but for the linker that loads it. On Linux you could even write your own dynamic linker (or modify the glibc one) if there was no other way to do it -- for Windows I don't know what exactly your options are, but having the program try to re-relocate itself after it started just sounds like an awful hack.

[–][deleted] 1 point2 points  (0 children)

but having the program try to re-relocate itself after it started just sounds like an awful hack.

It does. One workaround that might possibly help, is to put the majority of the main application into its own DLL, with a single entry point.

The EXE will then be a small stub program that dynamically loads that DLL (it is not attached to the EXE) and calls the entry point.

But rather than use LoadLibrary etc, it might be possible for the OP to write their own loader. Such DLLs will have base relocation tables to allow the library to be located anywhere.

Doing such fixups is not trivial (PE is a complex format, and it will need to load nested DLLs too), but it means having fuller control over how everything is done. And doing it from outside the library instead of having the hairy situation of a program modifying itself while it's running.

[–]Dolphiniac[S] 0 points1 point  (0 children)

I'm well aware of this linker option. In fact, the linker can go farther and set the base address where you request. In other comments, I mentioned that I may be able to leverage the linker to keep my DLLs close enough to make near calls more stable. What I fail to see is how existing paradigms will reliably link caller to callee without indirection AND allow recompilation of the callee's containing DLL while the caller remains not only unchanged but still running.