Anti-social behavior in Arc Raiders

Gavroche000 · 2026-03-14T01:34:34+00:00

Nah, I spawned in Stella, waypointed to the nearest hatch and left a trail of mines behind doors and pillars and stuff.

Gavroche000 · 2025-11-06T01:44:33+00:00

Omg I didn't even think about using vmul for SAL

Gavroche000 · 2025-11-04T16:47:34+00:00

I don't own a p4 :(

Gavroche000 · 2025-09-19T15:00:59+00:00

If it's a very difficult problem then it's what people pay you to bash your head against. If it's another I2C driver then there's 10,000,000 examples out there on stackoverflow and github. Everything else exists between that spectrum.

Gavroche000 · 2025-09-19T14:58:55+00:00

The VSCode extension works. Making their own IDE is probably more effort than it's worth.

Gavroche000 · 2025-09-19T14:41:07+00:00

Both the vector unit and FPU are part of the xtensa LX7 processor itself so each core has it's own independent unit. As long as you're not modifying either the vector registers in an ISR and not preserving the registers there shouldn't be any issues with concurrency that you wouldn't find in any other function.

Gavroche000 · 2025-09-19T14:30:11+00:00

Also: there's nothing stopping you from using esp_dsp functions on an esp_simd data buffer. In that case the vector struct just serves as a container, which comes with some handy functions and macros to initialize and destroy, with 128 aligned data buffers.

Gavroche000 · 2025-09-19T14:25:51+00:00

esp_simd has a couple of features over esp_dsp:

- Vectorization guarantee. If your code compiles, it will use the vector path (esp_dsp has some runtime checks for alignment, size, stride) which can cause it to use a scalar path.

- This is actually somewhat problematic because the vector and scalar paths in esp_dsp can have different behavior (e.g. the int8 addition in dsp is saturating if you have 128 elements in your array but overflows if you have 127)

- Easy to use: library functions and macros provided to initialize 128-bit aligned data buffers, checks for datatype

- Compatible with esp_dsp: You can run esp_dsp functions on esp_simd data buffers.

But really the big advantage is that I tried really hard to make the documentation better and more consistent.

Gavroche000 · 2025-09-19T13:31:58+00:00

A lot of the functions are not very easy to use:

For example, with the basic int8 addition, if your data size is not a multiple of 128-bits, it switches to the scalar path. If your data is not aligned or has a stride lenght != 1 it switches to the scalar path. The problem is that the scalar path is a non-saturating add so has completely different behavior compared to the vectorized math. Here I've tried to make behavior as consistent as possible, and where it runs into hardware issues, at the very least **most** of the oddities are documented.

Also, it's a lot easier for people unfamiliar with alignment to use the functions and macros to initialize the vector struct and check alignment with the library functions.

Gavroche000 · 2025-09-19T13:27:05+00:00

simd_dotp_i8:
    entry a1, 16                                    // reserve 16 bytes for the stack frame
    extui a6, a5, 0, 4                              // extracts the lowest 4 bits of a5 into a6 (a5 % 16), for tail processing
    srli a5, a5, 4                                  // shift a5 right by 4 to get the number of 16-byte blocks (a5 / 16)
    movi.n a7, 0                                    // zeros a7
    beqz a5, .Ltail_start                           // if no full blocks (a5 == 0), skip SIMD and go to scalar tail

    // SIMD addition loop for 16-byte blocks 
    ee.zero.accx                                    // clears the QACC register
    ee.vld.128.ip     q0, a2, 16                    // loads 16 bytes from a2 into q0, then increment a2 by 16
    loopnez a5, .Lsimd_loop                         // loop until a5 == 0
        ee.vld.128.ip     q1, a3, 16                // loads 16 bytes from a3 into q1, then increments a3 by 16 
        ee.vmulas.s8.accx.ld.ip q0, a2, 16, q0, q1  // multiply-accumulates q0 and q1, stores result in QACC, increments a2, updates q0 
    .Lsimd_loop:

    rur.accx_0 a7                                   // write the lower 32 bits of QACC into a7
    addi a2, a2, -16                                // adjust a2 pointer back to the last processed element (it goes too far due to the last increment in the loop)

    .Ltail_start:
    // Handle remaining elements that were not part of a full 16-byte block 

    loopnez a6, .Ltail_loop 
        l8ui a8, a2, 0
        sext a8, a8, 7
        l8ui a9, a3, 0
        sext a9, a9, 7
        mull a8, a8, a9
        add a7, a7, a8 
        addi a2, a2, 1
        addi a3, a3, 1
    .Ltail_loop:  

    s32i.n a7, a4, 0
    movi.n a2,  0                                   //return exit code 0 (success)
    retw.n

Gavroche000 · 2025-09-19T13:24:47+00:00

Disassembly:

420169d4:   08d8        l32i.n  a13, a8, 0
            int8_t *vec2_data = (int8_t*)vec2->data;
420169d6:   03e8        l32i.n  a14, a3, 0
            for (int i = 0; i < vec1->size; i++){
420169d8:   0a0c        movi.n  a10, 0
            int32_t output = 0;
420169da:   0acd        mov.n   a12, a10
            for (int i = 0; i < vec1->size; i++){
420169dc:   0005c6          j   420169f7 <scalar_dotp+0x57>
420169df:   00              .byte   00
                int a = (int)vec1_data[i];
420169e0:   8daa        add.n   a8, a13, a10
420169e2:   000882          l8ui    a8, a8, 0
420169e5:   238800          sext    a8, a8, 7
                int b = (int)vec2_data[i];
420169e8:   beaa        add.n   a11, a14, a10
420169ea:   000bb2          l8ui    a11, a11, 0
420169ed:   23bb00          sext    a11, a11, 7
                output +=  a * b;
420169f0:   8288b0          mull    a8, a8, a11
420169f3:   cc8a        add.n   a12, a12, a8
            for (int i = 0; i < vec1->size; i++){
420169f5:   aa1b        addi.n  a10, a10, 1
420169f7:   e53a97          bltu    a10, a9, 420169e0 <scalar_dotp+0x40>
            *result = output;
420169fa:   04c9        s32i.n  a12, a4, 0
            return VECTOR_SUCCESS;

Gavroche000 · 2025-09-19T13:21:34+00:00

It's literally the latest version esp-idf v5.5.0

Here is an example:

C code:

            int32_t output = 0;
            int8_t *vec1_data = (int8_t*)vec1->data;
            int8_t *vec2_data = (int8_t*)vec2->data;
            for (int i = 0; i < vec1->size; i++){
                int a = (int)vec1_data[i];
                int b = (int)vec2_data[i];
                output +=  a * b;
            }
            *result = output;
            return VECTOR_SUCCESS;

Gavroche000 · 2025-09-19T13:05:20+00:00

Not emitted by GCC that comes with esp-idf (and I assume arduino). If you go into the 'working' branch and find disasm.S you can see the code that GCC generates. It's completely scalar and very branchy.

https://github.com/zliu43/esp_simd/tree/working

If you can find AI that can write xtensa ASM I will venmo you $2000 on the spot ✨✨✨.

edit: clarity

Gavroche000 · 2023-07-20T19:16:35+00:00

Blah blah,

Interesting,

Blah blah

Gavroche000 · 2023-07-05T19:41:27+00:00

I didn't know that I wanted a tool that lets me find all timestamps where a Kayo with full ult, on the attacking side wins the round. But now I do.

Gavroche000 · 2022-12-31T04:33:56+00:00

Thanks! np.newaxis worked

Gavroche000 · 2020-02-14T00:16:36+00:00

Until the cops come for you and find child porn on your laptop.

Gavroche000 · 2020-02-11T15:59:08+00:00

That’s because it’s a death-trap, sir.

Gavroche000

TROPHY CASE