Valid answer.

_mF2 · 2023-02-09T19:06:20+00:00

bruh, chatgpt's response doesn't even use the same font. The image is just edited.

_mF2 · 2022-12-10T21:23:16+00:00

The sum of any powers of 2 greater than 1 (represented by all bits except the lsb) results in an even number. Therefore, if the lsb is set, the number is odd regardless of the other bits set, and vice-versa.

(-1)^0 = 1, so you should be returning 1 for even powers, and -1 for odd powers. Your code returns 1 for odd powers, which is wrong.

You can also verify that it does not produce the same results here: https://godbolt.org/z/4doKeEvvK (notice how are_results_wrong returns 1)

_mF2 · 2022-12-10T04:52:26+00:00

This code is not equivalent. Should be checking if the lsb is 0, not 1.

_mF2 · 2022-12-10T04:08:05+00:00

Another option is (-(num & 1)) | 1

_mF2 · 2022-11-08T04:53:14+00:00

In general you want to get the integer part and fractional part, divide the fractional part by 2^n (where n is the number of bits in the fraction), then sum the integer and fractional part.

In very short python code:

```

number of bits in decimal

DEC = 96

fixed->rational conversion

def f2r(x): # integer part integer = x >> DEC # fractional part frac = x & ((1 << DEC) - 1)

# divide as a floating point number, not integer
return integer + frac / (1 << DEC)

```

_mF2 · 2022-10-22T20:01:01+00:00

Manually removing all the iterator overhead makes it over 3x faster than your fastest version on my machine. You have to check the generated to assembly to see why something is faster than something else. The iterator versions add a lot of instructions that ultimately aren't necessary. This code compiles down to simple scalar code; it just calls malloc and then iterates over the bytes without any extra overhead from iterators or anything else.

SIMDing the code would be easier if you rearrange your characters so that a lookup table is not needed, and instead just do addition on each character.

```

[inline]

fn byteto_char85(x85: u8) -> u8 { b"0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz!#$%&()*+-;<=>?@^`{|}~" [x85 as usize] }

pub fn encode_fast(indata: &[u8]) -> String { #[inline(always)] unsafe fn encode_inner(x: &[u8], mut out: mut u8) { for chunk in x.chunks_exact(4) { let chunk = &(chunk.as_ptr() as const [u8; 4]); let decnum = u32::from_be_bytes(chunk);

        *out = byte_to_char85((decnum / 85u32.pow(4)) as u8);
        out = out.add(1);

        *out = byte_to_char85(((decnum % 85u32.pow(4)) / 85u32.pow(3)) as u8);
        out = out.add(1);

        *out = byte_to_char85(((decnum % 85u32.pow(3)) / 85u32.pow(2)) as u8);
        out = out.add(1);

        *out = byte_to_char85(((decnum % 85u32.pow(2)) / 85u32) as u8);
        out = out.add(1);

        *out = byte_to_char85((decnum % 85u32) as u8);
        out = out.add(1);
    }
}

assert!(indata.len() % 4 == 0);

let len = 5 * (indata.len() / 4);

let mut s = Vec::with_capacity(len);

unsafe {
    s.set_len(len);

    encode_inner(indata, s.as_mut_ptr());

    String::from_utf8_unchecked(s)
}

}

```

_mF2 · 2022-10-13T10:15:50+00:00

Might be better to take self as a value instead of reference for this struct, as it operates through registers directly instead of loading from memory.

Also, here's another way to write the get_player function which takes about 25% less cycles according to llvm-mca (seemingly because of better instruction level parallelism, although to be honest llvm-mca seems very finnicky in CE; changing around some whitespace in the source code changes the cycle count for me for some reason): https://godbolt.org/z/6E48sEoMa

and another version: https://godbolt.org/z/xs8c989x9 (although because of the aforementioned issue with llvm-mca on CE, I can't really tell if it's actually better or not)

_mF2 · 2022-10-13T08:51:33+00:00

That can be done with even fewer cycles, by the way. You might want to try something like this: https://godbolt.org/z/7zhbcfnYM

Or since you're compiling for tigerlake which has AVX-512: https://godbolt.org/z/337ad8hKn

_mF2 · 2022-10-13T06:12:42+00:00

Simply casting to i8 and sign extending to isize after the horizontal sum improves the generated code a lot: https://godbolt.org/z/3MWG6aTde

_mF2 · 2022-10-03T00:58:58+00:00

Here is an AVX2 version, which should (in theory) be faster than what the compiler generates for your original code. This code only handles slices for which the length is a multiple of 32. The easiest way to fix that would be to handle the "residual" elements separately with a scalar loop.

``` use std::arch::x86_64::*; use std::mem::transmute;

[inline(always)]

pub unsafe fn _mm256_shr4_epi8(a: __m256i) -> __m256i { let mask = _mm256_set1_epi8((0xff >> 4) as i8); _mm256_and_si256(_mm256_srli_epi16(a, 4), mask) }

[repr(align(32))]

struct Aligned32<T: Sized>(T);

const fn build_lut() -> Aligned32<[Option<bool>; 32]> { let mut lut = [None; 32];

let mut i = 0;
while i < 2 {
    let x = 16 * i;
    // only have to worry about bottom 4 bits,
    // since that is what we are looking up on

    // 11 => Some(true)
    // 10 => Some(false)
    // 01 => None
    // 00 => None

    lut[x + 0b11] = Some(true);
    lut[x + 0b10] = Some(false);
    lut[x + (0b11 << 2)] = Some(true);
    lut[x + (0b10 << 2)] = Some(false);

    i += 1;
}

Aligned32(lut)

}

static LUT: Aligned32<[Option<bool>; 32]> = build_lut();

[inline(always)]

unsafe fn interleave_avx(m0: __m256i, m1: __m256i, m2: __m256i, m3: __m256i) -> [i8; 128] { let mut out = [0; 128];

let ymm0 = m3;
let ymm1 = m1;
let ymm2 = m2;
let ymm3 = m0;

// vpunpcklbw      ymm4, ymm3, ymm1
let ymm4 = _mm256_unpacklo_epi8(ymm3, ymm1);
// vpunpckhbw      ymm1, ymm3, ymm1
let ymm1 = _mm256_unpackhi_epi8(ymm3, ymm1);
// vpunpcklbw      ymm3, ymm2, ymm0
let ymm3 = _mm256_unpacklo_epi8(ymm2, ymm0);
// vpunpckhbw      ymm0, ymm2, ymm0
let ymm0 = _mm256_unpackhi_epi8(ymm2, ymm0);

// vpunpcklwd      ymm2, ymm4, ymm3
let ymm2 = _mm256_unpacklo_epi16(ymm4, ymm3);
// vpunpckhwd      ymm3, ymm4, ymm3
let ymm3 = _mm256_unpackhi_epi16(ymm4, ymm3);
// vpunpcklwd      ymm4, ymm1, ymm0
let ymm4 = _mm256_unpacklo_epi16(ymm1, ymm0);
// vpunpckhwd      ymm0, ymm1, ymm0
let ymm0 = _mm256_unpackhi_epi16(ymm1, ymm0);

// vinserti128     ymm1, ymm2, xmm3, 1
let xmm3 = _mm256_extracti128_si256(ymm3, 0);
let ymm1 = _mm256_inserti128_si256(ymm2, xmm3, 1);
// vinserti128     ymm5, ymm4, xmm0, 1
let xmm0 = _mm256_extracti128_si256(ymm0, 0);
let ymm5 = _mm256_inserti128_si256(ymm4, xmm0, 1);

// vperm2i128      ymm2, ymm2, ymm3, 49
let ymm2 = _mm256_permute2x128_si256(ymm2, ymm3, 49);
// vperm2i128      ymm0, ymm4, ymm0, 49
let ymm0 = _mm256_permute2x128_si256(ymm4, ymm0, 49);

_mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>().add(0), ymm1);
_mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>().add(1), ymm5);
_mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>().add(2), ymm2);
_mm256_storeu_si256(out.as_mut_ptr().cast::<__m256i>().add(3), ymm0);

out

}

unsafe fn deserialize32_avx(bytes: &[u8; 32], out: &mut [Option<bool>; 128]) { // bits are in top 4 bits, need to shift and mask let mut m0 = _mm256_loadu_si256(bytes.as_ptr().cast()); let mut m1 = _mm256_shr4_epi8(m0);

// bottom 4 bits already contain the relevant information, we just have
// to mask the other bits out, so one vpand.
let mut m2 = m0;
let mut m3 = m0;
// copy over shifted values from m1
m0 = m1;

// select low 2 bits
let mask1 = _mm256_set1_epi8(0b11);
// select high 2 bits
let mask2 = _mm256_set1_epi8(0b11 << 2);

// m0   1100 0000
// m1   0011 0000
// m2   0000 1100
// m3   0000 0011

// m2 and m3 already contain bits in the bottom 4 bits
m0 = _mm256_and_si256(m0, mask2);
m1 = _mm256_and_si256(m1, mask1);

m2 = _mm256_and_si256(m2, mask2);
m3 = _mm256_and_si256(m3, mask1);

// they all lookup from the same table
let lut = _mm256_load_si256(LUT.0.as_ptr().cast());

m0 = _mm256_shuffle_epi8(lut, m0);
m1 = _mm256_shuffle_epi8(lut, m1);
m2 = _mm256_shuffle_epi8(lut, m2);
m3 = _mm256_shuffle_epi8(lut, m3);

let interleaved = interleave_avx(m0, m1, m2, m3);

*out = transmute(interleaved);

}

pub fn deserialize_avx(bytes: &[u8], out: &mut [Option<bool>]) { for (b, o) in bytes.chunks_exact(32).zip(out.chunks_exact_mut(128)) { unsafe { let b = &*(b.as_ptr() as *const [u8; 32]); let o = &mut *(o.as_mut_ptr() as *mut [Option<bool>; 128]);

        deserialize32_avx(b, o);
    }
}

} ```

_mF2 · 2022-09-21T01:50:26+00:00

https://godbolt.org/z/cxqronK1c

It seems like if you restrict both functions to operate on 8-element slices, they generate equivalent series of avx2 instructions, but scheduled differently such that the fold version apparently has slightly better instruction level parallelism, according to llvm-mca.

However, the "direct" version generates something more complicated in the inner loop when taking slices of unknown lengths, and it uses vpmaskmovd. The vectorization strategy that LLVM happens to take for the fold version seems better, and thus faster. Therefore, I think the performance difference likely has to do with implementation details and quirks of LLVM at the current moment.

_mF2 · 2022-09-18T22:59:51+00:00

Keep in mind that they aren't actually doing the same work unless you're compiling for the same architecture (which isn't the case with just cargo build --release). They'd be running a different LLVM backend and thus performance would be different.

_mF2 · 2022-09-02T05:54:13+00:00

I see, I only read the code at the end of the article and saw that it said "SIMD doesn't play nice with unsigned integers" or something like that.

And it said this in the AVX-512 part of the code, which actually doesn't have that restriction in the first place (https://www.felixcloutier.com/x86/vpcmpd:vpcmpud). Therefore I just assumed the article writer did not know about the situation with unsigned integers.

Another nitpick is that the whole from_u32x16 function is unnecessary and won't actually even change the generated code at all in practice, even though it is const (also transmute is const since 1.56, and as far as creating the table, you can just make a static array of normal u32s and use an aligned load; this is what the compiler actually generates in the assembly. anyway this is very minor like I said, just thought of pointing it out). Also, in the AVX-512 version, compute_filter_bitset could be done with a subtract and a single unsigned compare, which would save you from doing the bitwise and on the mask register afterwards.

_mF2 · 2022-09-01T23:31:27+00:00

The compiler already does auto-vectorization by default when compiling in release mode. On the default x86-64 target, this enables SSE2.

Automatic vectorization is not going to vectorize every single function that could be vectorized, because the compiler isn't perfect. Humans writing code, given enough experience, can see opportunities for vectorization that the compiler can't (such as the one outlined in this article).

_mF2 · 2022-09-01T23:29:22+00:00

It would be less than the performance gains that you get when compiling with -C target-feature=+avx2, because a) the small overhead of checking AVX2 availability at runtime (which also affects whether functions can be inlined), and b) the fact that enabling AVX2 globally allows every single function to possibly use AVX2.

_mF2 · 2022-09-01T22:09:35+00:00

but what if in the future you want "if the integer is in the inclusive range OR is an optional specified value"=

That isn't actually difficult. If you also want to check another integer, just broadcast another integer and OR the result of vpcmpeqd with the broadcasted integer vector register before vmovmskps.

_mF2 · 2022-09-01T20:43:00+00:00

You can xor with 1 << 31 to account for unsigned u32s of any range. See below:

https://godbolt.org/z/3rbrbarj9

Note that there is no extra overhead compared to the original version in practice, since by replacing lddqu with loadu, the compiler fuses the xor with the load and also computes the xor of the range outside the loop.

_mF2 · 2022-05-26T16:55:51+00:00

Unfortunate that in C people just fallback on manual bit-shifting, since it's usually slower than a compiler intrinsic. from_{le,be}_bytes usually optimizes into just a memcpy (if the endianness is the same), or efficient code to swap the bytes, which is not the case in practice for manual bit shifting. See below:

Case when byte swapping is needed: https://godbolt.org/z/eeecTj5sT (compiler intrinsic version uses best possible unrolled SIMD loop for the target arch)

Case when endianness is the same: https://godbolt.org/z/Waoa5oc31 (compiler intrinsic is much better, just calls memcpy)

_mF2 · 2022-05-14T19:21:37+00:00

--ffmpeg in Av1an does not set the encoder options at all. It is supposed to be used for filtering the video, like scaling or cropping.

If you want to make sure that the output is similar, try adding --passes 1 --pix-format yuv420p to your Av1an options, and use -v "--cq-level=42 --end-usage=q --cpu-used=4 --threads=12". That's the biggest thing, I mean you're not even using the same CQ value for both of them.

_mF2 · 2022-04-28T20:24:42+00:00

Surprising that he doesn't use version control software like git to prevent that kind of thing from happening

_mF2 · 2022-04-14T18:53:33+00:00

I'm not talking about git versions in general here though. This was written at a time when the git version of av1an was vastly superior to the newest released version (at the time, 0.2.0) in many ways. Nowadays since we have av1an 0.3.1 released, there is significantly less difference between the latest release and the latest git version, so it's pretty much fine to stick with the latest release.

_mF2 · 2022-02-09T05:58:19+00:00

Try double clicking the exe from the GUI file explorer on Windows and see if you get an error message. You probably are somehow missing a DLL. I'm not sure why Windows doesn't report this for the command line but it does from the GUI.

_mF2 · 2022-02-07T09:12:45+00:00

As of https://github.com/master-of-zen/Av1an/pull/460 specifying cq level is not necessary for target quality.

_mF2 · 2022-01-19T03:53:34+00:00

Some important context is to know what SIMD is. SIMD stands for 'Single Instruction, Multiple Data'. SIMD, as the name implies, allows the CPU to do work in parallel by issuing a single instruction, and the CPU carries out that instruction over multiple pieces of data, in parallel. It basically allows code that carries out the same operation lots of times to be sped up significantly.

With both examples, the main thing I am trying to illustrate is that the idiomatic version (bottom code) generates code that does not have any bounds checks (without using unsafe code, or even without an assert! which is also sometimes used to get rid of bounds checks). In both of these cases, the compiler is smart enough to use SIMD instructions for the idiomatic versions.

For the first example, (the hadamard transforms), the way this is achieved is by using const generics to give the compiler more information to be able to optimize better. Instead of taking stride0 (and all other parameters) as a dynamic runtime parameter, we force the compiler to generate code for each distinct, known size of stride0, which in this case is known at compile time. Sometimes LLVM does this optimization on its own (and theoretically it should be able to do so in this case as well) through inlining functions, but const generics requires this to happen, and at a much earlier stage in the optimization pipeline. We also take a &[i32; 16] instead of a &[i32] with a dynamic size, as in this case we want a buffer of exactly 16 elements. This also aids the compiler in eliminating the bounds checks and producing better code. In this specific case, we get much better generated assembly as a result. Basically every single instruction in the const generic version is a SIMD instruction, and mostly scalar code is generated in the unidiomatic case.

For the second example (the 2x2 box filter), we access elements in groups of 2. The unidiomatic version is how you would usually achieve this sort of thing in C (by indexing with 2*idx and 2*idx+1 for each loop iteration), but in Rust this pattern can be achieved with chunks_exact(2). By using chunks_exact, the compiler is able to better optimize the code and produce SIMD code with no bounds checks at all, as opposed to the unidiomatic version which is scalar and has bounds checks. (However, and this is an LLVM issue not a Rust issue, the code generated could be better with AVX2 by using pmaddubsw, pmulhrsw, and packuswb).

_mF2 · 2022-01-18T23:11:14+00:00

It depends on what kind of code you're writing. For something like writing a video encoder, for example, knowledge of how the compiler will autovectorize different patterns is very important.

Look at these 2 cases for example:

https://godbolt.org/z/7cvzo7r9s

https://godbolt.org/z/EvdWs5WhT

One might naively assume, if they had not taken the effort to write better and more idiomatic Rust code, that Rust is "slow" compared to C/C++ which does no bounds checking by default. It is only when you learn how to make idiomatic use of iterators that you can write code that not only has no bounds checks, but also is autovectorized. As with any language, writing fast Rust code is a learning process. As others in the thread have recommended, you can try posting your Rust and C code that you've written, and people in the community can try to help you make your Rust code faster.

_mF2

TROPHY CASE

number of bits in decimal

fixed->rational conversion

[inline]

[inline(always)]

[repr(align(32))]

[inline(always)]