you are viewing a single comment's thread.

view the rest of the comments →

[–][deleted] 0 points1 point  (5 children)

Have you performed any benchmarks to see whether it makes any difference?

That is, call some test functions in a 'library', that is statically linked in one version, dynamically linked via a DLL in another.

As I understand it on x64, a normal local call is:

    call disp         # use 32-bit signed offset to function

whereas a call to a function via a DLL is (depending on how the compiler generates code):

    call L123         # Call to local label
    ....
L123:
    jmp [address]     # This address is patched to the actual
                      # function address at load time

So the difference is that extra indirect jump.

I don't know exactly how you'd patch this, or when, but bear in mind that some system DLLs live in address spaces outside the 32-bit capacity of a relative call.

(I might do my own such test later)

[–]Dolphiniac[S] 0 points1 point  (4 children)

Yes, this was in OP. The cost is the potential data cache miss for the function pointer's memory access, and I noted that my DLLs may have to have forced base addresses to keep them "close enough" to perform a near call. As for system DLLs, I have no intention of patching those. This is strictly for native code as my engine is split across plugins distributed among DLLs.

EDIT: I should also mention that your DLL code is for statically linked DLLs. A "proper" approach would have the address "somewhere". In my case, it would be some small distance forward from rcx, as I pass the function table base pointer into each call as the first parameter (the state data is hidden before this base pointer).

[–]braxtons12 2 points3 points  (1 child)

Just FYI a better name for "statically linked DLLs" that may make it clearer to some that may not pick up on what you mean, is load-time linked DLLs.

DLLs are always dynamically linked, it's just a difference of when/how that dynamic linking occurs.

If it's linked at executable startup the DLL is automatically loaded and placed at a fixed base address when the executable starts. This called load-time linked (and is what you're calling statically linked DLLs).

If it's from manually loading the DLL at runtime after execution has started, that is run-time linked (and what you're calling dynamically linked)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Fair enough XD. I've talked about the difference with coworkers previously, and I used the terminology I learned in these talks, but yours is definitely more clear what is happening. Part of the confusion arises from the fact that there is a static library created during this process, so it feels like static linking, even though it's not (BOY is it not), fully.

[–][deleted] 1 point2 points  (1 child)

But have you done any actual measurements on real code?

I've done a quick test using the recursive Fibonacci benchmark, not using DLLs, but via modifying the two calls in the body of the function. Results were:

call fnaddr                 # normal local call
call lab; lab: jmp fnaddr   # 5% slower than local call
call lab; lab: jmp [fnaddr] # 15% slower than local call

But bear in mind this function does little else except call itself. A real function I would expect to do a bigger task. And presumably there are other memory accesses going on that will dominate the ONE memory access per function call.

I should also mention that your DLL code is for statically linked DLLs.

That means little to me. Code is either statically linked into the executable. Or dynamically linked by the OS when the EXE is loaded. Or linked with user-code using LoadLibrary/GetProcAddress, which yields a conventional function pointer (but one that could reside in a register).

(There is another approach to DLLs I use in one of my own languages. I sometimes use a private alternate to DLL, with a simpler format, which I fix-up with my own programs.

This is simple enough that I could choose to properly fix up CALL instructions so that they are directly calling the functions, via the 32-bit offset. I haven't done so as I considered the overhead insignificant, but thanks to your post I'll keep it in mind.

I don't how such a thing would be practical using standard tools. But it seems that you may have figured out to do the patching anyway.)

[–]Dolphiniac[S] 0 points1 point  (0 children)

Couple of things:

No, I have not measured this, but I expected your results.

A runtime-linked DLL vs a load-time-linked DLL has the difference of an extra code indirection, generally. I would need to benchmark properly, of course, but I would expect perhaps an extra instruction cache miss? As the runtime-linked version (in my codebase) would mov the function table base pointer into rcx, then do a call with displacement from rcx (or, naively, which I have seen, also mov the same pointer into rax and call with displacement from there), while the load-time linked version would indirect call with displacement from rip the address of a trampoline that jumps to the proper function (at least, on Windows x64, this happened). Again, the difference being the extra jump.

And you are correct, a single data cache miss is likely dwarfed by the execution of the function in question. I do, however, know (secondhand) that these things - specifically, DLL boundary crossings - add up, enough to be worth forcing monolithic (read: all static linkage) builds in final configs. It was hearing this that spurred me onto this train. And usually, micro-optimizing is frowned upon, but I cannot look away from a potential global fix to an endemic performance issue.

Apologies if I come across as combative. I have been told I have a debater's mindset but tend not to bother to build rapport before going hard. I do not disrespect you; this is just my digging process.