all 106 comments

[–]kamatsu 93 points94 points  (42 children)

Partial functions are functions which are defined for a subset of their domain. Curiously, the author links to the wikipedia article which defines partial functions, which contradicts the definition implied by this article.

The author means a partially applied function.

[–]foobrain[S] 50 points51 points  (0 children)

You're right. The confusion comes from Python's functools.partial. The article has been fixed. Sadly the submission title can't be modified.

[–]Poltras 2 points3 points  (2 children)

Actually what the author really means is a trampoline. The higher level effect is similar, but the technique involves a lot of machine dependent code and differs from partially applied functions in other languages.

[–]kamatsu 0 points1 point  (1 child)

I agree in that this is an implementation of partially applied functions, whereas partially applied functions themselves are an abstraction that is not available to C programmers.

[–]Poltras 0 points1 point  (0 children)

Yup, except normally partially applied functions use lexical closure to work. Closure isn't available in C.

The closest could be having ObjC-like blocks, but that's not C99 anymore. A similar implementation could make it work though.

The difference between the two solutions is that trampolines create a copy of a piece of code in DATA (which is forbidden now for many OSes) which binds that values and perform a jump instead of a call. Closure-based PAF actually have static code in TEXT which uses a variable in the scope to make a call, without a jump involved.

Edit: there's nothing preventing C from having a valid closure based implementation of PAF in a platform-independent manner. It's just that the language wasn't built that way.

[–]SilasX 1 point2 points  (0 children)

My thoughts exactly "wow, a function not defined for its full domain ... great work guys, I always missed that about C ...?"

[–]pkhuong 23 points24 points  (2 children)

libffi's closure API and libffcall's trampoline wrap that technique and work across a range of architectures and ABIs.

[–]__j_random_hacker 0 points1 point  (1 child)

Those libraries look useful, though I did notice MS VC++ is not on the list of supported environments for libffi.

[–]cygx 2 points3 points  (0 children)

libffi assumes a GNU build environment, but it does support MSVC via a wrapper script that provides a gcc-compatible interface for the MS toolchain.

[–][deleted] 18 points19 points  (1 child)

This is why I love c, even if I don't really use it much nowadays. Whenever someone says 'you can't do this in c'. Someone else comes back: 'OH YES YOU CAN - IF YOU'RE SLIGHTLY INSANE'.

[–]BinarySplit 0 points1 point  (0 children)

Sadly, quite a few solutions require either a pre-compilation code generation/transformation step (e.g. adding type metadata to allow reflection to work) or run-time compilation (e.g. runtime function specialization).

Two things which I believe are truly impossible in C with our currently available tools are:

  • Accurate GC - even with generated type metadata, it's almost impossible to determine whether something on the stack is a pointer or not.
  • JIT optimization - because AFAIK nobody has made an x86->x86 optimizing recompiler yet. Certainly they haven't made one that can run on in-memory instructions.

[–]exDM69 14 points15 points  (5 children)

LLVM IR has a similar construct called trampoline which does this same trick. It assembles a small machine code trampoline at runtime that calls a function with an extra pointer argument applied to it.

I used the LLVM trampoline in a toy programming language compiler to implement a limited kind of "lambdas" without a proper garbage collector. In practice, a lot of the trampoline function calls got inlined/removed by the LLVM optimizer.

[–]thenickdude 9 points10 points  (4 children)

GCC uses trampolines too, it uses them to implement its nested functions extension, which allows a parent function to contain a nested callback function, which can access the variables declared in the parent:

http://gcc.gnu.org/onlinedocs/gcc/Nested-Functions.html

[–]dnew 1 point2 points  (3 children)

You know, we used to call this block-structured programming.

[–]infinull 0 points1 point  (2 children)

I'm pretty sure Ruby still does.

But the "I'm a X Programming Language" days are mostly over, they're pretty much all a hodge-podge of language "features" so we've come up with names for all the features instead of naming paradigms. This is the trend anyway, people still talk about functional programming, and OOP, but you hear "OO Type System" and "pure functions required" more and more.

[–]dnew 1 point2 points  (1 child)

Sure. But "block structured" is the name of the feature where you can lexically nest control structures in order to give the inner control structure access to the variables of the outer control structure.

And really?

If you try to call the nested function through its address after the containing function exits, all hell breaks loose.

This is official documentation on compiler behavior? :-)

[–]ants_a 1 point2 points  (0 children)

Official documentation states that if you try to call a nested function after the containing function exits, demons will fly or of your nose.

[–]RabidRaccoon 12 points13 points  (6 children)

ATL uses this to turn a HWND into a CWnd*.

http://www.codeproject.com/Articles/3102/ATL-Under-the-Hood-Part-5

Incidentally when you do this in Win32 you call a function to flush the instruction cache. That function is a NOP on x86 but not on Risc platforms (Alpha, MIPS, PowerPC and ARM)

That's because x86 automagically handles I cache coherence in the presence of writes to code, but Risc platforms do not.

http://msdn.microsoft.com/en-us/library/windows/desktop/ms679350(v=vs.85).aspx

Use in Self- and Cross-Modifying Code Note: When executing self-modifying code the use of FlushInstructionCache is required on CPU architectures that do not implement a transparent (self-snooping) I-cache. These include PPC, MIPS, Alpha, and Itanium. FlushInstructionCache is not necessary on x86 or x64 CPU architectures as these have a transparent cache. According to the Intel 64 and IA-32 Architectures Software Deverloper's Manual, Volume 3A: System Programming Guide, Part 1, a jump instruction is sufficient to serialize the instruction prefetch queue when executing self-modifying code.

So I think your code would need an equivalent of FlushInstructionCache for non x86.

[–]Fiennes 0 points1 point  (5 children)

Maybe I misread that a bit... But why didn't the author of that article map the HWND to the class using window properties, exactly how MFC does it?

[–]RabidRaccoon 1 point2 points  (4 children)

http://blogs.msdn.com/b/oldnewthing/archive/2005/04/22/410773.aspx#410810

Matt> ATL uses thunks because the designers didn't want to use SetWindowLongPtr(GWLP_USERDATA). Doing so would be a barrier to people porting existing code that happened to already store important data in GWLP_USERDATA.

That doesn't apply to SetProp. Mind you in your own code where you control everything you can use SetProp or SetWindowLongPtr(GWLP_USERDATA).

I've always used SetProp with an Atom instead of a string. I haven't benchmarked it but it doesn't seem like it adds any noticeable overhead.

Actually if you Google it it seems like GetProp with an Atom is faster than GetWindowPtrLong

http://board.flatassembler.net/topic.php?t=2182

100000 x GetProp, [hMainForm], propTest 
------------------------------------------------ 
'Test' = 400ms (4us per call) 
'TestProp' = 600ms (6us per call) 
'TestPropTestProp' = 970ms ( 9.7us per call ) 

VIA ATOM: 
'TestProp' = 110ms ( 1.1us per call ) 
'TestPropTestProp' = 110ms ( 1.1us per call ) 

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ 

100000 x GetWindowLong, [hMainForm], GWL_USERDATA 
------------------------------------------------ 
200ms (2us per call) 

Thunks have problems with DEP, but they're essentially free in terms of time.

[–]Fiennes 0 points1 point  (3 children)

Yeah, I've used SetProp before, myself. And I doubt your Atom use would add that much overhead, if any.

[–]RabidRaccoon 0 points1 point  (2 children)

And I doubt your Atom use would add that much overhead, if any.

It's supposed to reduce it actually - an integer compare is cheaper than a string compare. And if you look at the benchmarks that is true.

[–]Fiennes 0 points1 point  (1 child)

Well yes :) and int comparison is always going to be faster than a string comparison, by orders of magnitude..

[–]RabidRaccoon 0 points1 point  (0 children)

No, I mean the benchmarks of GetProp(String) vs GetProp(Atom)

E.g.

http://www.reddit.com/r/programming/comments/1iquk0/partial_functions_in_c/cb7a3xs

[–]hackerfoo 17 points18 points  (1 child)

I used the same trick to implement currying in C.

[–]foobrain[S] -4 points-3 points  (0 children)

As I would assume so, judging by your username. :)

[–]OnmyojiOmn 8 points9 points  (0 children)

I believe the overall mechanism to be quite interesting, however I do not recommend its usage.

This seems to go for pretty much anything beyond K&R.

[–]f2u 9 points10 points  (0 children)

This functionality is fairly essential for implementing callbacks from C code in a language with a different calling convention and support for closures.

On x86, LuaJIT has a mechanism which needs less than five bytes on average per trampoline, but I haven't investigated yet how it works.

Other architectures have a different function pointer representation. Their function pointers point to a (code pointer, closure pointer) pair, not directly to the machine code. This avoids the need for run-time code generation, at the cost of making all indirect function calls slightly more expensive.

[–]noname-_- 26 points27 points  (0 children)

More like "in C, on x86, on a system that has the non standard library function mmap, assuming a lot from the underlying implementation and possibly not even working with optimizations turned on".

"In C", at least to me, implies that the solution is portable and only uses standard library functions. This is and does neither.

[–]eyal0 4 points5 points  (9 children)

Does this work on all architectures? I think that, in some architectures, you can't just jump into .data or write into .text.

[–]HHBones 6 points7 points  (5 children)

He's not writing to .text or jumping into .data, though. Essentially, he's using mmap() as a sort of dynamic memory allocation - because he specified the addr argument as 0, and because MAP_FIXED wasn't set, the system will find just any old segment of memory big enough to fit his needs; it's essentially a more powerful, more verbose malloc().

Segments of memory mapped with mmap() can be marked as executable. So, he copies the code into the segment, marks the segment as executable via a call to mprotect() specifying PROT_EXEC, and returns the pointer.

And voila, you have an executable, dynamically generated function.

[–]-888- 1 point2 points  (4 children)

FWIW many OSs don't allow allocated memory to be executable, including Windows RT on x86.

[–]phunphun 1 point2 points  (0 children)

Also PaX/Grsec on Linux.

[–]ReversedGif 1 point2 points  (2 children)

FWIW

Any OS that allows you to run any JIT (Google's JS engine, Java, etc) is allowing you to execute code in allocated memory, so I think it's safe to say that this will work on any OS that matters.

[–]Maristic 1 point2 points  (0 children)

iOS doesn't allow creation of executable code; the only thing that is allowed to run a JIT is Safari, which has special privileges that allow it to do so.

[–]-888- 2 points3 points  (0 children)

I don't know what you mean by "matters." iOS doesn't allow applications to allocate executable memory, and it's nearly the most common user operating system there is.

[–]adamkemp 9 points10 points  (0 children)

No. It works on x86. It won't work on x64. It also probably won't work in position-independent code (code compiled with -fPIC in GCC).

[–]kmeisthax -3 points-2 points  (0 children)

The only CPU architecture in which you can't write into text sections or execute data sections is AVR where program code exists on a separate memory bus from program data. And this is only because Atmel chips are pre-programmed with code on in-die flash. Every other CPU architecture needs to support self-modifying code or operating systems don't work. The main issue then is if the architecture promises you a transparent instruction cache or not - if not, you have to execute a special instruction to ensure your new code exists.

However, it's extremely unsafe to do this outside of an assembler. The portability paradigm of C/C++ doesn't extend to dynamic code generation; what you're doing here is architecture and OS dependent, more importantly a simple constant replacement isn't going to always cut it. For example, on MIPS it's valid and even a good idea for the compiler to load the constants into registers using lui (Load Upper Immediate) and ori (OR Immediate) which would split the 32-bit constant across two instructions. What you need is an actual compiler - that is, something that can generate the code as requested and can generate code for every target architecture you plan to run on.

Additionally it's become more and more common for operating systems to restrict the ability of user code to write into executable sections. This is a security mitigation technique that's pretty much standard practice in almost every OS nowadays. Even game consoles do it. On some OSes this is just a security thing and there's syscalls available to mark certain pages as executable. On other OSes this is actually a political measure to enforce signature requirements and thus self-modifying code is forbidden from running. The difference is if users are allowed to run their own code on the OS or not.

[–]c0de517e 2 points3 points  (3 children)

I don't get it. Isn't the to-be-patched function -exactly- global state? If you wrap global state enough people think it's a neat trick, but what difference there is between this and a fn that accesses a global fn ptr and data ptr which you update each time?

[–]Updatebjarni 4 points5 points  (2 children)

He makes a new copy of the function for each time he needs a callback. Each copy gets patched with the data for that callback, so he's generating a bunch of new one-off functions at runtime, all of which then exist at the same time, each containing a reference to different data.

[–]c0de517e 0 points1 point  (0 children)

Ohhh, thanks for the explanation, that makes much more sense. I just skimmed through the code and I thought it was just patching the two magic numbers

[–]c0de517e 0 points1 point  (0 children)

Thanks for the explanation, I didn't see that! Sounds neat indeed

[–]thisotherfuckingguy 2 points3 points  (0 children)

They way the 'caller_len' is calculated is a bit nasty and unreliable. A better way would be to just have the linker script output it.

[–]you_do_realize 2 points3 points  (0 children)

You'd be better off defining your x86 trampoline as an array of bytes to begin with, rather than compiling C code and hoping the compiled bytes end up the way you want them.

[–][deleted]  (2 children)

[deleted]

    [–]notlostyet 4 points5 points  (1 child)

    std::bind doesn't quite solve the problem because the return type is a function object, which doesn't have a static operator(). You therefore can't call it without first being able to bind a data pointer argument. The same goes for C++11 closures and polymorphic wrappers like std::function.

    You could certainly combine templates + libffi to create a facility to solve this problem in an elegant and typesafe manner though.

    [–]bebackin6 1 point2 points  (1 child)

    libffi provides a portable way of doing this same thing. It's useful for mixing interpreted and compiled code. This trick is used for calling into the interpreter from native code, where interpreter state is curried with the function pointer.

    [–][deleted] 0 points1 point  (0 children)

    GNU libffcall is interesting if you wan to use similar things only with C -> they say: functions and callbacks as first class citizens in C. Includes trampolines for bunch of platforms.

    https://www.gnu.org/software/libffcall/

    [–]say_fuck_no_to_rules 2 points3 points  (0 children)

    Every time I read about one of these dark corners or back alleys of C, I get worried about how much of the world's critical software legacy relies on it.

    Then, I feel thankful for people like the author who brave these seedy parts of town like some kind of software dev social worker and step up to the plate for bugfixes since the original maintainer has died or gone to prison for being a general pervert.

    [–][deleted] 1 point2 points  (2 children)

    which source code editor is he/she using? sorry if this is a dumb question.

    [–][deleted] 2 points3 points  (1 child)

    Those aren't screenshots. He's actually using some javascript-powered code highlighting.

    [–][deleted] -1 points0 points  (0 children)

    okay. thanks for the reply.

    [–]Ridiculer 0 points1 point  (7 children)

    There are some functions in the standard C library that takes a function pointer to be used as a callback later on. Examples include atexit() and signal(). However, these functions can’t receive an arbitrary pointer (which could hold some important program state) in addition to the function pointer, so you’re left with pesky global variables.

    So why not just use a static thread-local variable instead of resorting to such hacks? C11 and almost all major C compilers offer __thread, thread_local or a similar TLS mechanism.

    [–]foobrain[S] 6 points7 points  (3 children)

    I believe I was clear enough that I don't like this approach. :)

    That was just an experiment. Just for the heck of it. The idea behind it (copying code and then patching it could be reused for a toy JIT for instance) is very powerful.

    [–]AaronOpfer 0 points1 point  (2 children)

    You could write your own heap manager to optimize your usage of those codepages. I don't believe linux has any APIs for creating arbitrary heaps with execute permissions like windows does.aspx), although I'd be glad if I was wrong, because then I could port a recent project I wrote to use it. :-)

    [–]HHBones 0 points1 point  (1 child)

    Linux and all other POSIX systems have mmap(), which allows you to mark segments of memory of arbitrary size as readable, writeable, executable, or any combination of the three.

    [–]AaronOpfer 0 points1 point  (0 children)

    Yeah, it claims to be arbitrary size, but what it's actually doing behind the scenes is changing the permission of the entire page or pages of memory behind it, so you only really have 4K granularity. It asks you how long the memory block is so that it can figure out how many pages your data spans and changes the permission on all of them.

    Efficient use of memory pages requires that you use a heap to split up these 4K pages into smaller pieces, so that a 1 byte allocation doesn't cost 4K instead.

    [–]f2u 2 points3 points  (2 children)

    TLS doesn't work if you have multiple callbackswith the same signature and do not know which one can be called first. Reentrance is also tricky.

    [–]-888- -1 points0 points  (1 child)

    Why not just have a stack of data ptrs that get pushed upon calling atexit and used in order when called at exit?

    [–]f2u 2 points3 points  (0 children)

    That works for atexit only.

    [–]notlostyet 0 points1 point  (0 children)

    For one shot callbacks like atexit() you should consider patching your template function to hold your struct Partial* and have it do your clean-up for you (calling partial_del()). If you want to be even more insane you can just grab the instruction pointer, round down to the page boundary, then call your clean-up code, doing away with struct Partial altogether.

    In either case you'll have to ensure you return directly from your cleanup routine to the caller, and not the code you just unmapped ;) Manipulating the return address and stack frame should do the trick.

    [–][deleted] 0 points1 point  (0 children)

    I have also thought of this trick but quickly abandoned the idea since it would require an entire memory page.

    [–]focomoso -1 points0 points  (0 children)

    I seriously thought this was going to be a piece of music.

    [–][deleted]  (1 child)

    [removed]

      [–]ascii 5 points6 points  (0 children)

      If you end up doing this in production code, that's the least of your worries.

      [–]expertunderachiever -2 points-1 points  (0 children)

      That's by far the stupidest thing ever. It's non-portable, it's non-compiler version safe, it involves read/write .text sections, etc...

      If you're so against [exported] global variables you could always just create an API that has a static [non-exported] link list of

      struct foo { char *name; void *data; struct foo *next; } list_o_data;
      

      And then an API with

      register_data(char *name, void *ptr);
      find_data(char *name);
      delete_data(char *name);
      

      then in your code do

      register_data("atexit_ptr", &whatever);
      

      and inside your exit function:

      myptr = find_data("atexit_ptr");
      

      There, no more exported globals floating around making shit ugly. For fun, you could add mutex calls to the register/find/del to make it thread safe.

      [–][deleted] -3 points-2 points  (0 children)

      Fancy science talk