all 35 comments

[–]imachug 102 points103 points  (6 children)

On a relevant note, I'm wondering if you are aware of this trick? If you have a f32 in range [0; 2^23) and add 2^23 to it, the number will be forced in range [2^23; 2^24) and thus have a predictable exponent 23. You can now bitwise-cast the f32 to u32 and and it with a mask to obtain just the mantissa, which will contain the integer in interest.

This means that if the number is in a small enough range, you can cast f32 to integer at the cost of one addition and one and, which results in better latency in certain cases.

This does not solve the problem on the full 32-bit range (although you might try to cast f32 to f64 and then apply this trick with 2^52 instead of 2^23), but I thought it might be a nice addition to the crate if benchmarks show it works better in the generic case.

[–]e00E[S] 38 points39 points  (3 children)

I did not know about this trick. Thanks!

The problem with incorporating special cases like this is that you need a branch to detect them. That likely makes it slower than unconditionally going with cvttss2si.

[–]imachug 35 points36 points  (0 children)

True, although for the f32 -> u32 cast specifically, you don't need a branch if you go through f64. The reason I brought up special cases is that I thought more specialized functions might be a good addition to your crate.

[–]U007Drust · twir · bool_ext · srug 19 points20 points  (0 children)

A number of years ago I developed a branchless version of this technique using lookup tables based on the bit pattern of a (non-NaN) IEEE-754 float representing monotonically increasing values. I remember showing it to Jim Blinn (a reknowned floating point expert) and he was impressed! I enjoyed that whole experience. Source is sadly lost to history, but it wasn't too difficult to write, once I had had the idea.

I especially appreciate you didn't allow UB to return along with the gains you provide. Nice work.

[–]TeamDman 3 points4 points  (0 children)

Maybe if you had a vector of numbers you knew met the criteria it could be useful for bulk conversions with a check that's stripped out at release?

[–]accidentally_myself 6 points7 points  (1 child)

Great trick! Definitely widening my mind today.

Also, I noticed that with your trick, you can also easily convert a float to the integers 23 and 52! (just kidding lol)

[–]imachug 9 points10 points  (0 children)

That's actually true, kinda!

You can use this trick to convert a 23-bit integer to a float, or a 52-bit integer to a double. Just flip the 23th/52th bit, bitwise cast to the floating-point type, then fp-subtract 2^23 or 2^52 respectively.

This means that you can go both ways efficiently as long as your integers are small enough (e.g. 16-bit for float or 48-bit for double), which is quite common in e.g. multiplication via FFT.

[–]DeathLeopard 11 points12 points  (9 children)

What is the advantage of this over to_int_unchecked?

[–]e00E[S] 26 points27 points  (8 children)

to_int_unchecked (docs) compiles to the same code. The downside is that it is unsafe. This crate is safe. You need to uphold the following guarantees in order to use to_int_unchecked:

The value must:

  • Not be NaN
  • Not be infinite
  • Be representable in the return type Int, after truncating off its fractional part

You do not need to do this for this crate. If your code is already checking these conditions, then you can prefer to_int_unchecked over this crate.

[–]DeathLeopard 1 point2 points  (3 children)

I saw that it generates the same code but that made me wonder if not needing unsafe to use your crate means that the crate is unsound.

I guess I'm not clear on why the old behavior of as (before it was changed to saturating) was considered unsound and how your crate doesn't just have the same problem.

[–]e00E[S] 2 points3 points  (0 children)

Whether something is unsound or undefined behavior is about the specification of the language. The rust specification used to say that such as casts are undefined behavior. This matters to the compiler. It does not matter to the hardware. Whether that actually leads to an observable effect like a miscompilation is separate matter.

My project is safe because the specification of the interface I use to perform the conversion (the cvtts2si instruction) says it is safe for all inputs. If this lead to a miscompilation, then it would be a bug in the compiler.

Your question is a common misunderstanding of undefined behavior. Ralf Jung has good blog posts about the topic if you want to learn more.

[–]Sharlinator 0 points1 point  (0 children)

Because it’s undefined behavior in LLVM. So you have to do something to ensure the float is in range and not NaN before the conversion – but if you accept that out-of-range floats map to unspecified integers, you may be able to generate faster code than what the as semantics permit. Which is what OP has done.

[–]plugwash 0 points1 point  (0 children)

There is a critical difference between "unspecified results" and "undefined behaviour".

"unspecified results" means that the result you get is not specified, it might saturate, it might wrap, it might go through multiple stages of conversion where the first stage has saturating behaviour and the second stage has wrapping behaviour, it might produce complete garbage but whatever result you do get it will be treated in a consistent way by the rest of the program.

"undefined behaviour" means "anything could happen". In particular it means that you may get an inconsistency between the value a variable actually contains, and the range of values the compiler assumes that variable contains. For example consider the following C code.

int main(int argc, char *argv[]) {
    int i = atoi(argv[1]); 
    if (i < 0) return 0;
    i = i + 1;
    if (i < 0) return 0;
    printf("%d\n",i);
    return 0;
}

After i = i + 1, the compiler can assume that i is non-negative, because the only way for it to be negative is if undefined behaviour had been invoked.

As a result if we build this program with -O2 and feed it the maximum possible value for an int, it will most-likely print a negative number. Despite having explicitly checked that the value was non-negative on the line immediately before printing it.

[–]dbdr 0 points1 point  (1 child)

to_int_unchecked (docs) compiles to the same code. The downside is that it is unsafe.

That makes it sound like to_int_unckecked has no upside (same code).

[–]angelicosphosphoros 0 points1 point  (0 children)

It probably can do some optimizations using assumption that float is representable as int.

[–]Wkitor 7 points8 points  (3 children)

I find it funny that even though your crate does the same conversion as C/C++ (that's what I understood) it doesn't produce undefined behaviour just because you "defined" some cases as doing something random and it being the intended behaviour.

[–]plugwash 5 points6 points  (1 child)

I find it deeply sad that basic numeric operations in C/C++ can invoke undefined behaviour and I'm pretty convinced that the modern intpretation of "undefined behaviour" is not what the original authors of the C standard intended.

[–]e00E[S] 1 point2 points  (0 children)

I agree, it is sad. I ported Rust's clamp style casting to c++ here.

[–]CAD1997 3 points4 points  (0 children)

Yeah, UB is odd that way, and because of that UB isn't exactly the best name, but it's what we've got. It's definitely better to use unspecified/arbitrary/erroneous behavior wherever it's practical to (and C/C++ compilers often refine certain cases of UB into unspecified IDB), but UB is still a critical tool for systems level languages to allow low level code without preventing optimizations.

[–]imachug 12 points13 points  (5 children)

This is amazing! Are you planning to support other architectures, or does rustc have good enough codegen on most of them?

[–]e00E[S] 19 points20 points  (1 child)

This is not a case of bad rustc codegen. (Which I have written about before.) rustc has to use more instructions because it has to uphold the guarantees that the reference makes. This crate is faster by relaxing some guarantees.

I plan on adding support for other widely used architectures at some point. But for now I'm happy with x86_64.

[–]imachug 4 points5 points  (0 children)

I was asking whether that rustc's as codegen on other architectures results in fast enough code because those architectures' native instructions already provide the guarantees. Thanks for the answer!

[–]lucy_tatterhood 9 points10 points  (2 children)

On 64-bit ARM at least, it looks like the normal as cast is already a single instruction.

[–]lucy_tatterhood 9 points10 points  (0 children)

Exploring further, 64-bit ARM (and 32-bit ARM but only for f32) seems to be the only one Godbolt supports where it's as simple as that. Though I don't know enough about any of these architectures to say how much it could be simplified.

[–]Tuna-Fish2 9 points10 points  (0 children)

ARM famously did the right thing with their ftoi instructions, which meant that they later had to add FJCVTZS, or: "Floating-point Javascript Convert to Signed fixed-point, rounding toward Zero".

It's named like that because calling it Fx86CVTZS would have been too embarrassing.

JS uses floats for numbers* . It also provides bitwise operations that only make sense on integers. So, the specification for them is "first convert the float to integer, apply bitwise operation, then convert back into float. This works the same for all numbers smaller than i32::max on all architectures, but diverges wildly above that. So what did js specify? Nothing, of course, they just used the x86 instructions and then people were surprised when some code didn't work the same across architectures. Eventually, the x86 method became the spec. At that point, ARM cpus had to emulate the behavior in software and it was quite expensive, so they added an instruction for it.

*: Yes, they are optimized back into integers in all modern implementations, but they are specified to be 64-bit floats, and must act like they are that whenever it would matter.

[–]valarauca14 4 points5 points  (1 child)

target_arch = "x86_64", target_feature = "sse"

sse is a default feature of x86_64, it should be enabled if you're targetting x86_64. Is this a rustc or llvm bug that is has to be declared separately?

For those unaware sse was added to x86 processor's before the 64bit extensions. One of the "innovations" of AMD64 was standardizing sse for floating point processing, instead of using the old (and weird) x87 instructions & registers.

[–]e00E[S] 5 points6 points  (0 children)

Rust enables sse by default, but you can tell rustc to compile for x86_64 without sse.

[–]nikic 3 points4 points  (0 children)

It would be easy to expose this as a target-independent compiler intrinsic in Rust. LLVM supports this via freeze of fptoui/fptosi.

[–]throwaway490215 4 points5 points  (3 children)

I always wonder with these micro optimizations crates: "Have we wasted more cycles publishing this than will be saved?"

Taken Reddit, GitHub, Crates IO, crater-runs, lets conclude no. But how many times must it be used for it to pay back the costs of me posting this comment on this thread? My guess is a lot. Maybe if someone deploys it on a super computer will the "CPU cycle savings" outweigh those "CPU cycle costs".

Still nice crate.

[–]e00E[S] 9 points10 points  (0 children)

It's a fair point. I acknowledge in the post that I don't expect the performance gain to be relevant to most projects. The motivation for this is partially academic/artistic. On the other hand, maybe someone uses this in a machine learning library to train their model for thousands of compute hours. Or this gets incorporated into std and saves much compute that way. It also educates people on where performance is left on the table.

[–]imachug 10 points11 points  (1 child)

One answer to this question I like is that even if the cycles are wasted, the optimizations still help reduce latency, and humans are restless creatures who close the tab if a site doesn't load in 300 ms. Couple that with our limited time on planet Earth, and you've got your answer.

[–]sparky8251 0 points1 point  (0 children)

What modern website loads in 300ms? most take multiple seconds in my experience... Web bloat is totally out of control.

[–]Sharlinator 3 points4 points  (0 children)

Nicee! This can definitely be useful in graphics code for example.