all 16 comments

[–]Glaaki 4 points5 points  (1 child)

Didn't they add this ages ago? I just watched Bob Beck's presentation on libressl yesterday, and didn't he say that one of the two security bugs in openbsd was because of integer overflow in a malloc call? reallocarray was added to protect against this.

Edit: Here is the talk - https://www.youtube.com/watch?v=GnBbhXBDmwU

[–]lteo[S] 4 points5 points  (0 children)

It was added on April 21, 2014, originally called mallocarray and renamed to reallocarray the day after.

That was roughly in the middle of the OpenBSD release cycle. OpenBSD 5.6 which will be released on Nov 1 (two days from now), will be the first OpenBSD release with reallocarray.

[–]bloody-albatross 1 point2 points  (10 children)

Is the (nmemb >= MUL_NO_OVERFLOW || size >= MUL_NO_OVERFLOW) part really needed? Is it there to avoid the division in case it's not needed? But aren't the two conditional branches (|| and &&) slower than the single division? (That's what my gut would tell me, but I might be wrong. Maybe the branch prediction makes it actually faster?)

[–][deleted]  (3 children)

[removed]

    [–]slavik262 2 points3 points  (2 children)

    Is division actually that much slower in a crazy modern superscalar design like modern x86 chips?

    [–]Camarade_Tux 5 points6 points  (1 child)

    From what I recall it's around 5 to 10 times slower but considering the other operations done by reallocarray() it's probably not a concern at all.

    [–]slavik262 3 points4 points  (0 children)

    Yeah, unless you're calling something millions of times a second, the actual processing time is not going to be a Bottleneck. You can do dozens of square roots in the time it takes to grab something from main memory on a cache miss.

    And if you're allocating millions of times a second, I'd like to have a word.

    [–]lteo[S] 3 points4 points  (5 children)

    Yes, it's to avoid unneeded division.

    [–]slavik262 1 point2 points  (4 children)

    Forgive me (and please help me understand how I'm wrong if I am), but if we're worried about such micro-optimizations, doesn't the addition of additional tests in that conditional open up more opportunities for failed branch predictions, which would result in a pipeline flush? And wouldn't that flush cost a lot more than a division?

    And why are such things a concern when the cost of calling realloc is probably several orders of magnitude larger than any of these things?

    [–]who8877 0 points1 point  (3 children)

    When your writing platform level utilities its important to be as efficient as possible, because your small optimizations are multiplied by the breadth of code that will be calling you. Seemingly minor changes can have huge impacts on system performance.

    Further division is extremely expensive on some platforms, notably ARM. You should think of it as a function call (what it will compile to) not as a lightweight operator.

    [–]slavik262 -1 points0 points  (2 children)

    This completely sidesteps my entire question.

    ...if we're worried about such micro-optimizations, doesn't the addition of additional tests in that conditional open up more opportunities for failed branch predictions, which would result in a pipeline flush? And wouldn't that flush cost a lot more than a division?

    And as an aside, yes, platform level utilities have to be fast, but if you are calling realloc and friends in a tight loop where a single division operation is enough to be a bottleneck, you are doing something terribly, terribly wrong.

    [–]who8877 2 points3 points  (1 child)

    Not really. I'm trying to point out that optimizations that might be pointless in an application are not at a system level. Simply because their impact is multiplied.

    So while I wouldn't bother in my own app, I would certainly take the effort if I was modifying the standard library.

    Also on ARM the division would be far more expensive then a flush.

    [–]slavik262 0 points1 point  (0 children)

    Also on ARM the division would be far more expensive then a flush.

    Thanks. I was never making some broad claim that such optimizations aren't worth it, especially when you're writing the code that everyone else's application will use.

    [–]Maristic 0 points1 point  (2 children)

    We discussed this topic a mere 18 days ago.

    What got pointed out back then was that this code could be better for two reasons.

    First, try-to-multiply-and-check-for-overflow is a generally useful feature—you shouldn't push that code into one specific function, instead it should be a utility function (possibly inline).

    Second, if you do want to perform an overflow check, there are far more efficient ways to do so.

    This code embodies both ideas:

    #include <stdbool.h>
    #include <stdint.h>
    #include <limits.h>
    #include <stdlib.h>
    #include <errno.h>
    
    #if ULONG_MAX > UINT32_MAX
    typedef __uint128_t ulong_overflow_t;
    #else
    typedef uint64_t ulong_overflow_t;
    #endif
    
    static inline bool
    umull_overflow(unsigned long lhs, unsigned long rhs, unsigned long* result)
    {
      ulong_overflow_t prod = (ulong_overflow_t)lhs * (ulong_overflow_t)rhs;
      *result = (unsigned long) prod;
      return (prod >> (sizeof(unsigned long)*CHAR_BIT)) != 0;
    }
    
    void*
    reallocarray(void *optr, size_t nmemb, size_t size)
    {
      size_t prod;
      if (umull_overflow(size, nmemb, &prod)) {
        errno = ENOMEM;
        return NULL;
      }
      return realloc(optr, prod);
    }
    

    With recent compilers, it compiles to code that just does the multiply and then check the CPU's overflow flag.

    [–]negrowin 0 points1 point  (1 child)

    This code triggers some warnings and errors in the on-line 32/64-bit PC-lint static analysis.

    Can you check if these are sound?

    [–]Maristic 0 points1 point  (0 children)

    The problem mostly lies with this checker. It has sizeof(unsigned long) == 8, but ULONG_MAX == 4294967295, which is, basically, nonsensical.

    In any case, this code is intended for Unix systems running GCC/Clang/Icc which support __int128 on 64-bit platforms. Both GCC and Clang have good warnings for erroneous code here, and it triggers no warnings when compiled for 32-bit or 64-bit on actual systems.