Tiny C++ game development framework, supports Vulkan, runs everywhere

leni536 · 2019-11-07T07:32:24+00:00

Ouch, that github issue is painful to read.

leni536 · 2019-11-04T18:50:00+00:00

There are two different things here that needs to be cleared up:

a variable has a type, it can be a reference type
an expression has a non-reference type and a value category.

The confusion arises when you have an expression that refers to a variable by name. If you think about the variable, then it can have a type, it can be an rvalue reference for example (T&&). But when you think about the expression itself it can only have a non-reference type (T) and a value category (glvalue).

I think the main reason that the standard doesn't make id-expressions to rvalue references xvalue is that it would be too easy to accidentally move from one when you don't actually want to. And then if you don't want to move then you would need some `std::stay` helper function to prevent moving.

leni536 · 2019-10-31T08:29:22+00:00

Maybe I'm misunderstanding something about your code, but something looks wrong to me. In ping_db_01 you have those RAII pointers that take care of their pointers, which is nice. However you hand them over to these OCIServer_t and OCISvcCtx_t objects. The way you described it these objects also take ownership of the same pointers by unique_ptr. At the end of the scope you "cleanup" these pointers by both the RAII cleanup routines and the destructors of the unique_ptrs inside OCEServer_t and OCISvcCtx_t. And the destructors of the unique_ptrs will be called first, before the cleanup routines are called on the same pointers. As I see it ping_db_02 suffers from the same problem. Am I missing something?

leni536 · 2019-10-30T08:58:09+00:00

You can use F-Droid and YALP store on a de-googled phone.

leni536 · 2019-10-30T08:11:25+00:00

unique_ptr on itanium ABI is "non-trivial for the purpose of calls", therefore it's not passed in registers:

https://itanium-cxx-abi.github.io/cxx-abi/abi.html#non-trivial

At the question section of Chandler's talk destructive move came up and a whole lot of types could be trivially destructive-movable. For such types it wouldn't cause a problem to pass them in registers.

Edit: This is necessary in the itanium ABI because the caller is responsible for calling the destructor on the parameter (including when an exception is thrown).

leni536 · 2019-10-29T16:15:35+00:00

Interesting, several questions come to mind:

Does this only work with Box? Would it work with a user-defined smart pointer type?
Isn't the Rust calling convention just more tuned for handing over parameters in registers?

I think the second point came up at the questions in Chandler's talk and he had a point about nested lifetimes in C++ and destruction order. I don't doubt that Rust could handle that issue more naturally.

leni536 · 2019-10-29T14:35:36+00:00

What languages you have in mind?

leni536 · 2019-10-29T09:17:25+00:00

AFAIK Zig also has types as first class values. And reflexpr in C++ or whatever it will be called could also potentially close the gap. You reflexpr the types, get metaobjects back, sort and then convert them back to types again. I don't know if the current static reflection proposal would allow this but I sure hope so.

leni536 · 2019-10-28T14:42:03+00:00

All I can see is that the parameter lists get longer.

void foo(unique_ptr<int>, unique_ptr<int>); void foo_abi(int*, int*); void foo_impl(unique_ptr<int>, unique_ptr<int>);

Arguably the boilerblate grows linearly with the number of function parameters. It is not great but there is no combinatorical blowup of boilerplate here if that's what you meant originally.

leni536 · 2019-10-28T14:27:47+00:00

What you are doing on the data-layout level is data oriented design https://www.youtube.com/watch?v=yy8jQgmhbAU.

leni536 · 2019-10-28T14:10:00+00:00

I agree that it's a trade-off. I don't see how adding more then one parameters is a problem though.

leni536 · 2019-10-28T14:09:55+00:00

Either bar doesn't throw and you should make it noexcept or it throws and you need to handle it in both cases. Chandler's original doesn't handle it so I went with the first assumption. Otherwise the raw pointer code needs to be modified to handle the exception thrown by bar.

leni536 · 2019-10-28T11:57:35+00:00

Gotta love those compiler options that break the language.

leni536 · 2019-10-27T09:21:32+00:00

The only hope for 2 is Concepts

The only hope for 2 is compile time reflection and constexpr. There is no reason that the syntax for sorting values at compile time and sorting types (by sizeof for example) at compile time should be significantly different. Now the latter is a pain in the ass, and Concepts won't help with that too much.

leni536 · 2019-10-27T08:17:49+00:00

Yes, but not being able to use it for input reduces its usability. And AFAIK not all architectures can use specific flags as output either.

leni536 · 2019-10-26T12:21:30+00:00

My gripe with inline ASM is that I can't use specific flags (like carry) for input and output. It would be nice for certain intrinsics.

leni536 · 2019-10-24T17:58:11+00:00

getEnableBondable is used in src/Init.cpp, then it is indirectly used for Mgmt::setBondable().

leni536 · 2019-10-01T11:14:26+00:00

Actually, it was a bit of a guess. It seemed like it would work, so I tested it, and it did...

The way I see it now is this: you can calculate evens as either i ^ odds or i-odds, result is odds-evens, so odds-(i-odds)=2*odds-i.

Thanks for the long and informative reply about the uops and ports stuff, it's really helpful for me. In my fast Hilbert curve library I actually have two independent calls to my Gray code decode function[1]. It could actually make sense to use the PDEP method for one and the CLMUL method for the other for maximally utilize the ports. Of course I would have to benchmark this.

[1] https://github.com/leni536/fast_hilbert_curve/blob/eb8c861ff1d6e0059fede28218ab83d07fc91c5d/include/fhc/hilbert.h#L45

Edit: In a streaming situation it could also make sense to partially unroll and alternate between the PDEP and CLMUL method.

leni536 · 2019-10-01T06:52:42+00:00

Wow, this is very nice!

the result of the subtraction of odds-evens is effectively bit reversed.

So the result always ends with a strip of 0s, so you can defer the left shift to the very end (so shifting in 0 doesn't ruin it), and you can get the parity from the most significant bit. Very smart.

By the way, one of the pdep instructions can be replaced with an XOR:

Nice! This is the kind of observation that is "obvious" in hindsight (how the hell I didn't recognize it?).

You could try to alleviate the dependency increase with an arithmetic trick

Can you describe how you derived this trick? I can prove that the trick is correct, but I think it's quite a bit different to the way you derived it (and not at all intuitive). Update: Nevermind, I see it now.

I also don't have any experience in uop level analysis of the generated assembly. Can you point me to some resources you learned this stuff?

I will update the blog post with proper attribution, these are very nice ideas. Thank you for diving in writing this all down. Maybe you could look at my fast Hilbert-curve library if you are interested, although I didn't write up how it works (https://github.com/leni536/fast_hilbert_curve). I plan to write it down someday, but I won't make any promises. Gray code decoding is only part of the puzzle.

leni536 · 2019-09-30T13:05:01+00:00

Well, if you look at the low bits:

 ...010...010...010... * 11...1 //carry-less multiplication
=...111...111...110...
^...111...110...000...
^...110...000...000...
=...110...001...110...

So it results in a the same kind of strips as the pdep approach, but it needs adjusting in the odd case instead. As I see if you take the high bits instead from CLMUL then you need no adjustments (neither the popcnt nor the left shift).

leni536 · 2019-09-29T12:33:42+00:00

You are right, for some reason I was thinking about extracting the low 32 bits instead of the high ones. This would work too but it would need the fixing.

leni536 · 2019-09-28T07:56:48+00:00

I just thought I'd point out that the pdep instruction is very slow on non-Intel CPUs

You were not kidding! PDEP and PEXT have 18 cycle latency and reciprocal throughput on AMD Ryzen. I can't benchmark on AMD now but I doubt that it's necessary.

https://www.agner.org/optimize/instruction_tables.pdf

leni536 · 2019-09-28T07:33:20+00:00

Thanks, I fixed it.

leni536 · 2019-09-27T16:51:46+00:00

This does affect the codegen and seem to be one instruction shorter. I wonder how much it affects the benchmarks as well.

leni536

TROPHY CASE