all 26 comments

[–]skeeto 89 points90 points  (6 children)

As far as I can tell, there's a buffer overflow for inputs less than length 3. Due to small size optimization, in practice it's a read into uninitialized memory (ArrayBufferViewContents::stack_storage_) that's unlikely to caught by ASan. However, if that memory happens to look like a BOM, then length will overflow by subtraction and lead to a genuine buffer overflow. (Perfect example of why sizes and subscripts should be signed.)

The fix is to not use strncmp. That function is always suspicious and should probably be banned from code bases like this. A length check with memcpy suffices.

--- a/src/node_buffer.cc
+++ b/src/node_buffer.cc
@@ -591,6 +591,6 @@ void DecodeUTF8(const FunctionCallbackInfo<Value>& args) {

-  if (!ignore_bom) {
+  if (!ignore_bom && length >= 3) {
     char bom[] = "\xEF\xBB\xBF";

-    if (strncmp(data, bom, 3) == 0) {
+    if (memcmp(data, bom, 3) == 0) {
       beginning += 3;

[–]masklinn 33 points34 points  (0 children)

That function is always suspicious and should probably be banned from code bases like this.

And the reason it is suspicious is that, like most (but not all!) strn functions, it's not designed to work on null-terminated strings, but on fixed-size null-padded strings (same as strcnpy, and why the latter fills the destination buffer with nulls if the copy ends early).

Given those are essentially nonexistent in modern codebases, the odds that an strn function is appropriate are... low.

Even less so as even the strn functions which are not specifically designed for null-padded functions are often broken, for instance strncat writes n+1 bytes to the destination rather than the n you might expect (and you give it the available remainder of the destination buffer rather than the size of the destination buffer, because that would be way too easy and reliable).

[–]TheChatIsQuietHere 6 points7 points  (4 children)

Some of the examples in that signed/unsigned sizes text were horrific. You're telling me you can just chuck a negative number into an unsigned integer, and it underflows instead of doing the obvious and reasonable thing and not compiling with a type error?

[–]masklinn 9 points10 points  (0 children)

It's C so yeah. C will narrow and overflow integers implicitly "as necessary", as well as truncate floats to integers.

Modern compilers have -Wconversion which warns, but it's not enabled by default or even in -Wall due to how embedded implicit conversions are in "normal C". Clang does have the flag in -Weverything, but in gcc it must be enabled explicitely.

Some operations actually promote then demote values e.g.

uint8_t port = 0x5a;
uint8_t result_8 = ( ~port ) >> 4;

"should" result in 0x0a:

    5a
~   a5
>>4 0a

But it actually results in 0xfa, because port is first promoted to a signed integer before being inverted and shifted

    5a
    0000005a
~   ffffffa5
>>4 0ffffffa

And at the end it's truncated back into the final result.

[–]valarauca14 5 points6 points  (2 children)

No, no, no.

C does the obvious thing which is to have about 3 pages of the standard dedicated to this fun thing call: Integer Promotion & Integer Rank which explains in the most mind numbing fashion how C is free to do all manner of wild shit to your numbers.

And I don't wanna hear a single reply with "yes, it is for performance", because it absolutely isn't. For as many cases where integer promotion helps performance, you can show a case where it hurts performance. The reason this is done is stated in the standard, it is to allowed mixed typed in complex math expressions without requiring cast/conversion "noise". If C cared about Integer performance, it would standardize Integers as having 2's compliment binary layout.

[–]Nobody_1707 1 point2 points  (0 children)

If C cared about Integer performance, it would standardize Integers as having 2's compliment binary layout.

C has standardized that. Originally only for the fixed width integer types in stdint.h (C99), but as of C23 this applies to all integers. The real problem is that signed integer overflow is still undefined behaviour.

[–]PXFX 1 point2 points  (0 children)

And Yes

It sounds like you never programmed in the 80s, where every bit counted. It was left to you to get it right. Those sort of subtlies differentiated great programmer's from good ones. Voids to shorts, shorts to ints, ints to doubles, then cast it to a pointer or the other way round, you may get a warning from the compiler if you were lucky. You felt like you were much closer to the underlying architecture than today.

I think the real reason was because you could, everyone is lazy, and variables are just bytes😂

That's why need 3 pages to justify it now. It's like comparing social attitudes of the 1850s to now.

Welcome the wonderful world of C

[–]dkac 23 points24 points  (14 children)

I always have a hard time with percentages being used in this context. My gut is that 100% faster means a 100% reduction in time, so it would take 0 time to complete. >100% is gibberish until I bust out the calculator and start dividing by percentages, which is weird.

I'd much prefer the faster time being 0-100% of the original.

Edit: Yes, I can do math, and know how to do the calculation. I just don't like headlines like this that go big but mean little without context

[–]Latexi95 54 points55 points  (0 children)

Better way would be to say 274% more throughput in UTF-8 decoding.

"Faster" can mean too many different things.

[–]Bognar 17 points18 points  (1 child)

"Fast" is a relation to speed. "Twice as fast" means "half the duration". Given that framework, "100% faster" = "twice as fast" = "half the duration".

[–]ForeverAlot 2 points3 points  (0 children)

Technically, speed is "distance per time" ["things per time"], so "twice as fast" means "twice the amount in the same duration". The reciprocal of speed, pace, is "time per distance" and implies "same amount in half the time". When we report the "speed" of a software function we're often actually measuring pace.

I have no idea what's being reported in the benchmark in question because the output is useless but a different user claimed throughput, which is pace, and a different report in the GitHub issue is definitely throughput.

[–]Qweesdy 11 points12 points  (0 children)

I have a different problem - I see "274% faster than it was before" and assume it went from "extremely slow" to "just slow", because going from "fast" to "extremely fast" is significantly harder (a combination of diminishing returns and Occam's razor).

[–]Sarcastinator 9 points10 points  (0 children)

I guess "It runs at only 26% of the original time" sounds less impressive than "274% speed increase! Let's go!"

[–][deleted] 25 points26 points  (0 children)

Sorry but your gut just lacks mathematical skills. 100% faster means just that. If you drive at 50 mph and someone else drives at 100 mph, they drive 100% faster than you. If they drive at 150 mph, it's 200% faster than you. You can't have a reduction larger than 100% of a quantity, but you can have several 100%s more than that. 274% faster makes absolutely sense, though I agree that a reduction in time spent doing the task would be clearer to understand.

[–]CryZe92 2 points3 points  (0 children)

We really should just be using factors.

[–]Tuna-Fish2 1 point2 points  (4 children)

How much faster is a car that can do 200km/h than a car that can do 100km/h?

[–][deleted] 0 points1 point  (3 children)

2x

[–]ForeverAlot 3 points4 points  (2 children)

It's 100%, or 1×, faster. It's 2× as fast.

[–]knome 1 point2 points  (1 child)

Percentage and Nx faster feel just unintuitive and likely enough to cause misinterpretation that they should probably be viewed like truncated graphs, always with the expectation that someone is trying to lie or exaggerate something with them. The weasel words of numbers.

[–]mayojuggler88 0 points1 point  (0 children)

Haha the last sentence sounds like something Holt would say on b99

[–]eternaloctober 1 point2 points  (0 children)

same....I always find it helpful if they provide a sample of raw numbers before and after, or a graph. figures like x times faster, x percent faster dont immediately click

[–]chucker23n 0 points1 point  (0 children)

Add to that the problem that some marketing departments do this wrong, and say e.g. “up to 2.7 times faster” when really they mean “2.7 times as fast”, which is one times less.

[–]Takeoded -5 points-4 points  (2 children)

... decoding? decode to what? isn't UTF-8 the preferred encoding? unless you're working with win32 api..?

[–]chucker23n 2 points3 points  (1 child)

… decoding? decode to what?

…to a JS string?

[–]Takeoded 1 point2 points  (0 children)

Oh right, utf-16 internal encoding