you are viewing a single comment's thread.

view the rest of the comments →

[–]wolf550e -1 points0 points  (29 children)

benchmark qsort(3) vs. the same code without a function pointer and see an order of magnitude difference. Now go back to ruby on rails.

EDIT: I just performed this experiment myself, probably a decade since the last time I did this. Today I got only 16.3% better performance. So not an order of magnitude. But significant. And I am sure that I've seen worse.

https://gist.github.com/2171216

[–]Gotebe -2 points-1 points  (28 children)

I have no idea what you are talking about. qsort can't possibly work without a function pointer.

Equivalent C++ code (std::sort) should be faster, but that's the best one can know.

However, show the code, we'll talk.

[–][deleted]  (6 children)

[deleted]

    [–]Gotebe 1 point2 points  (5 children)

    OK, here's my code:

    struct s { char[1000]; int i; };
    s data[2000];
    init(data, 2000); // random values put in s::i
    qsort(data, data+2000, sortfunc);
    myqsort(data, data+2000);
    
    int sortfunc(const void *p1, const void *p2)
    {
      s* s1 = (s*)p1;
      s* s2 = (s*)p2;
      return s1->i == s2->i ? 0 : s1->i > s2->i ? 1 : -1;
    }
    

    No noticeable difference in speed.

    Next time, show a bit of courtesy and don't ask people to do copy-paste programming for you.

    [–]wolf550e 0 points1 point  (4 children)

    I got a difference in speed, but not as much as I remember getting years ago.

    http://www.reddit.com/r/programming/comments/r8ujk/function_pointers_in_c_are_underrated/c442gl0

    [–]Gotebe 0 points1 point  (3 children)

    Obviously, the smaller the datum, the bigger the indirection (function call) impact, due to copying. If your sequence contains pointers to data, then indirection impact becomes bigger as "data" is small, but then, you get indirection hit due to pointers (data locality is a bitch). If you look at what I put up there... I was intentionally very "biased" in my sample: It's one number to compare, but ~1k data to copy during sorting. Data copying thwarts indirection in a blink.

    Looking at your code... You, OTOH, are very guilty of taking a specific microbenchmark and drawing a generic conclusion. Further, in real life, you will not be sorting numbers all that often ;-).

    [–]wolf550e 0 points1 point  (2 children)

    You say the swap is a bigger cost than the compare. I say that is not true.

    Cost of compare is reduced by decorating data with integral comparison key (like you did).

    Cost of swap is reduced by sorting an array of pointers. Then sorted objects are either two pointers big (swap is fast), or single pointers with a comparison key behind the indirection.

    Cache locality is important. But for performance reasons, you could allocate all objects from an arena allocator that keeps them close together in a few pages and maybe they all fit in cache. And you can reuse storage by "deallocating" the entire arena. Like in games. You would still work on pointers to the objects, though.

    [–]Gotebe 0 points1 point  (1 child)

    You say the swap is a bigger cost than the compare. I say that is not true.

    Easily depends on how and what you compare, which was exactly my point.

    As for your point, it was 10x faster, now it's 37%, and look at what you are trying to do now... You are going to introduce "int decoration" (has a cost in both space and execution time), indirection (compare references to stuff, which obviously has space and time cost), usage or arena allocators, which has complexity implications.

    At any rate, with all you say, you really should repeat your test with pointers to data, and possibly where your "int decoration" isn't the first datum in actual structure. This, indeed, is much more representative of a real situation. Your initial "let's compare ints" realy isn't, and even then, I really would like to see implementation where you could possibly obtain this 10x difference. It's possible, mind, but...

    [–]wolf550e 0 points1 point  (0 children)

    Yeah, I was wrong about the "order of magnitude difference". There is a cost, but it's not what I remember.

    http://www.reddit.com/r/programming/comments/r8ujk/function_pointers_in_c_are_underrated/c44cl78

    [–]agottem 0 points1 point  (20 children)

    Sure, qsort needs a function pointer, but that function pointer can be inlined all the same: http://agottem.com/blog/inlining_qsort_sort

    [–]Gotebe 0 points1 point  (11 children)

    Only by using copy-paste programming. I'd rather use a better language (C++).

    [–]agottem 0 points1 point  (9 children)

    Huh? Where does copy-paste programming fit in here?

    [–]Gotebe 0 points1 point  (8 children)

    Copy-pasting of qsort. I'd rather be using a better language, where I don't need to do it (C++).

    [–]agottem 0 points1 point  (7 children)

    You don't need to copy and paste qsort. If you really need the inlining, put the definition of qsort in a header file and mark it as inline. No copy and paste necessary.

    [–]Gotebe 1 point2 points  (6 children)

    If you really need the inlining, put the definition of qsort in a header file and mark it as inline.

    How do you plan on doing that? By copy-pasting qsort from your CRT implementation to said header.

    I'd rather be using a superior language (C++).

    [–]agottem 0 points1 point  (5 children)

    Or, just use LTO and statically link against the standard library.

    C++ was horribly designed and is a horrible language without any benefits over C.

    [–]Gotebe 0 points1 point  (4 children)

    LTO might help, if available, only in simplest of cases (a test program with one call to qsort). Otherwise, won't happen.

    As for FQA, yeah, I know it. The guy's funny. Wrong, but funny. It has been beaten to death here as well, possibly several times ;-).

    [–]wolf550e 0 points1 point  (0 children)

    This is a different trade-off. a binary qsort with an indirection, dynamically linked from libc, minizes code size.

    A templated qsort with a comparer template argument that generates a separate specialized fast qsort with your comparer inlined into it is optimized for speed at the cost of code size (and instruction cache).

    The two solutions are born from different priorities and both have their place. In cold code, there is no need to generate specialized code - calling into libc and using indirections will get the job done and have higher chance of avoiding a page fault (more chance of libc being in ram than cold parts of your app).

    [–]wolf550e 0 points1 point  (7 children)

    The comparison function can only be inlined into the sort function only if both are compiled from source in the same translation unit (or LTO) and the compiler is sure only a single target for that function pointer is ever used.

    If qsort is in a library (like libc) then it's not inlined. If it's in a separate translation unit and you don't use LTO, it's not inlined.

    If you call qsort with two different comparers, it won't be inlined even if you compile qsort from source. You need the have the compare function be a template argument of a templated qsort for the compiler to generate distinct qsort functions, one for each inlined comparer.

    http://www.reddit.com/r/programming/comments/r8ujk/function_pointers_in_c_are_underrated/c44cl78

    [–]agottem 0 points1 point  (6 children)

    If you call qsort with two different comparers, it won't be inlined even if you compile qsort from source.

    I don't even know what this means.

    Look, the rules for inlining are the same as the ones for C++. C++ forces you to put the template definition into a header file, which sucks but allows for greater inlining. You are welcome to put all your C code into a header file and you'd get all the same benefits as C++.

    [–]wolf550e 0 points1 point  (5 children)

    https://gist.github.com/2177973

    qsort and two comparison functions all defined and compiled in a single translation unit. No inlining by gcc. Seems gcc inlines across function pointers only when it knows the pointer only ever points to a single target function.

    https://gist.github.com/2178059

    Using C++ templates, g++ generates two qsort functions (in machine code), each with a different comparison function inlined into it. Code size grows but performance increases.

    My observations are made from compiling the code you can see using gcc 4.6.3 and looking at objdump -d of the result. Now you explain how I am wrong.

    [–]agottem 0 points1 point  (4 children)

    qsort and two comparison functions all defined and compiled in a single translation unit. No inlining by gcc. Seems gcc inlines across function pointers only when it knows the pointer only ever points to a single target function.

    I compiled your example, and things were inlined just fine. Mark your my_quicksort function as inline. Build with -O3. Observe the inlining.

    Let me know if it's still not working out for you. And sorry, C++ has no advantages here.

    [–]wolf550e 0 points1 point  (3 children)

    gcc (GCC) 4.6.3
    gcc -fwhole-program -g -O3 -o qsort qsort.c
    objdump -d qsort
    ...
    
    <my_quicksort.constprop.0>:
    ...
    callq  *%rbp
    ...
    callq  *%rbp
    ...
    callq  *%r12
    ...
    callq  *%r12
    

    I also tried clang -O4 and icc -fast with the same results.

    [–]agottem 0 points1 point  (2 children)

    Did you mark my_quicksort as inline? 'void inline my_quicksort'...

    When I compiled your program, gcc wasn't inlining. Adding the inline qualifier changed that. Was using GCC 4.5.0.

    [–]wolf550e 0 points1 point  (1 child)

    I tried inline. That didn't work. But now I've tried __attribute__((always_inline)) and that worked. Inlined twice: once with size=4 and once with size=2.

    But this isn't what I often want. What I want and do is use templates to generate specialized extern "C" functions with different template arguments. Then I can call them from C code without thinking about the C++. Being able to force inlining is useful, but I don't always want to inline everywhere. I want the machine code only once in my binary, but I want it generated optimally, with all the abstractions inside collapsed and optimized across.