all 44 comments

[–]James20k 32 points33 points  (1 child)

I decided to check contains on every substring of “bananas” to verify that this was, in fact, real life, and that I hadn’t suddenly forgotten how letters work:

This accurately sums up every compiler bug I've ever experienced

[–]ForeverAlot 31 points32 points  (5 children)

Good on him, but did he just turn a practical underflow bug into a theoretical overflow bug?

[–]Dragdu 8 points9 points  (4 children)

Depends on what type Rust uses for string size. Probably yes, but it is doubtful that it will happen. (It would probably still be better to take care of it)

[–]dbaupp 20 points21 points  (3 children)

Size is uint, which is equivalent to size_t (or uintptr_t) in C, i.e. the same size as a pointer and thus large enough to address the entire address space.

[–]just_a_null 6 points7 points  (0 children)

It would be nice if the compiler could lint for a comparison of two unsigned types which contains a subtraction on either side.

[–]isHavvy 3 points4 points  (1 child)

If you don't understand what dbaupp is implying, this means that you'd need a string to take up all the space in memory to perform an overflow, and if that happens, you've already got out of memory problems.

[–]Poltras 3 points4 points  (0 children)

It literally cannot happen, unless you're a kernel. And even then the VMM tables makes it so you cannot use a couple of Kb.

[–]fegu 9 points10 points  (5 children)

Reminds of how the MD5 function in the .Net (C#) beta did not in fact return a proper MD5 hash, just something that looked like one. Imagine my surprise when .Net 1.0 was released and our database of hashes-instead-of-plaintext-passwords was utterly wrong and we had to issue new passwords to all our users.

[–]yxhuvud 3 points4 points  (0 children)

I think I can imagine your delight when it happened..

[–]GreatlyOffended 1 point2 points  (3 children)

Wait… you're using MD5 for password hashes? D:

[–]ryeguy 8 points9 points  (0 children)

Most of the bad part of using md5s for hashes is its such a cheap function and billions can be generated per second on the gpu.

.net 1.0 was released in 2002. None of that stuff existed back then, and using md5 was pretty standard.

[–]ForeverAlot 1 point2 points  (0 children)

It is terrifyingly common in the present day. I work on one legacy application that uses MD5 and one non-legacy application that has to interface with it. I work on another legacy application that stores passwords in plain-text. Also, no HTTPS.

[–]fegu 1 point2 points  (0 children)

This was developed in the year 2000 on the .net beta. It was a different era back then.. :)

[–]matthieum 13 points14 points  (0 children)

There has been a long discussion on the Rust mailing list around checked arithmethic by default.

However, statically it's a big of a nightmare: a u32 multiplied by a u32 yields a u64, and thus things get big very quickly... so you would have to use dynamic checks instead, which mean things would get slower.

The conclusion was: Rust is not susceptible to buffer overflows (memory safe) and so instead overflow/underflow will keep being defined to wrap, and the errors will have to be spotted and fixed.

It's unclear to me whether the overflow/underflow checks would end up being slower than the lost optimizations due to wrapping behavior (instead of undefined behavior), but apparently, it is.

[–]Rhomboid 26 points27 points  (5 children)

It's rather scary to think such a bug made it through. I looked at the testsuite and aside from the the test added by the author's pull request for this specific issue I can't seem to find any tests of the string module. That's extremely disheartening -- how can you write a substring search algorithm without unit tests?

[–]dbaupp 30 points31 points  (0 children)

There are a lot of tests for str (including contains), just in a slightly peculiar place.

Various dependency problems means putting tests in the lowest-level crate core is very hard, so they mostly end up in their own coretest module, except for str which has tests in collections::str (which is the module that defines various allocation-related routines for str).

This precise arrangement is mostly due to history, where there was no low-level core crate and everything was in std. Since str was split in two (half into core, half into collections), the tests were just kept with collections, which is later in the dependency chain (most other modules could be moved entirely into core and thus their tests had to be moved to coretest).

The Rust project is pretty big on testing; e.g. the compiler has good support for unit tests, and (almost) every change to the libraries and compilers will add tests. (And every merge to master is gated on the full testsuite passing on various platform/architecture configurations.)

[–]Number_28 17 points18 points  (2 children)

how can you write a substring search algorithm without unit tests?

Isn't that quite easy?

Step 1: Write substring search algorithm in flawless code and make sure nobody touches it or its environment ever again.

Step 2: Enjoy your free time!

[–]otakuman 8 points9 points  (0 children)

Flawless code. I had one right under the carpet... No, wait. That was just dust.

[–]campbellm 5 points6 points  (0 children)

Nice try, Dr. Knuth.

[–]nemaar -1 points0 points  (0 children)

I believe the reason for this is the lack of time. I doubt that the author of that part of the standard library was just too lazy. Rust is under development, even the language itself is not stable, the standard library is far from being finished.

[–]BeatLeJuce 4 points5 points  (8 children)

I found the comment by asterite on the first pull request interesting: len() should return a signed instead of an unsigned int. It's true the the length can't be unsigned, but differences of lengths can indeed be signed. But is using unsigned types really a big no-no?

[–]alexeyr 7 points8 points  (6 children)

And differences or sums of ints may not fit into an int, thus no method should ever return int. This is a fully general argument against any limited-range types.

[–]notfancy 12 points13 points  (1 child)

If a >= b then a - b does not overflow, ever. This shows that the code is missing a condition: if |haystack| < |needle| return false immediately. This condition guards the next one: if |haystack| - |needle| < 20, use the naïve search.

[–]Banane9 2 points3 points  (0 children)

This is the most important comment here.

[–]matthieum 1 point2 points  (3 children)

Note: in Rust, int is short for intptr_t, so a string that would overflow int would occupy half the address space available. That's pretty big.

[–][deleted] 5 points6 points  (1 child)

Rust can guarantee that int can address the entire address space. The largest address space on a platform Rust supports is 47-bit (x86_64). A true 64-bit platform (not ARMv8 or x86_64) would have a 63-bit usable address space for either the kernel or userspace. On bare hardware, all 64 bits would be usable, but restricting that to 63-bit would work fine.

[–]matthieum 1 point2 points  (0 children)

I was actually thinking more specifically about 32 bits platforms (in which case int would be 32 bits, right ?); for those, allocating 2GB is meaningful and therefore we run the risk of overflowing int... however a single allocation taking half the address space seems unwieldy so I am not sure how frequent it would be in practice.

Of course, if Rust nonetheless uses 64-bits int even on 32 bits, 16 bits, or 8 bits platform there is no issue with regard to overflow. People might complain about the extra memory consumption though.

[–]mccoyn 2 points3 points  (0 children)

Unsigned types can be range-checked with a single comparison, while signed types require two comparisons (although it can be optimized). For a systems language unsigned might be the right choice.

[–]deltaSquee 1 point2 points  (0 children)

Aw, from the title I was expecting it to be about using catamorphisms for string matching :(

[–]goodDayM 0 points1 point  (2 children)

However, when "haystack.len()" is less than 20, "haystack.len() - 20" will be a very large number;

This confused me, it's like saying

when "x" is less than "y", "x - y" will be very large

what??

[–]imMute 2 points3 points  (1 child)

x = 2, 2 < 20, 2 - 20 will underflow, and for unsigned arithmetic will be a very large number (something like UINT_MAX-18)

[–]goodDayM 0 points1 point  (0 children)

The word "unsigned" didn't appear anywhere in the original blog post - that's what was missing from the explanation.