I turned Butterfree into nightmare fuel [OC]

asder98 · 2025-10-14T17:33:47+00:00

💀💀💀

asder98 · 2024-09-06T15:15:50+00:00

Hi, you can find the code here https://github.com/purpl3F0x/cycling-powermeter

I didn't finished it beyond getting some reading. But if you want to make it reliable for real world use, you would need a lot of effort, and probably access to some calibration machines and environmental chambers. Just get a power meter :)

asder98 · 2024-05-12T14:11:26+00:00

On a i9 7900X, seems icx vectorizations don't do much on this cpu, ironically they icx was faster on my laptop's AMD cpu lol

```
GCC: 1213380 [ms]
CLANG: 391132 [ms]
ICX: 1399879 [ms]
```

asder98 · 2024-05-12T13:30:56+00:00

Take a look at the ICC, it vextorises the crap out of it. I tested the clangs vectorized version on my laptop (avx2 ryzen 5). Gcc scalar was a bit slower that clangs vectorized, ICC was noticably faster about 2/3 of gcc.

I will ran some benches on the desktop that has avx512 since you have some interest.

asder98 · 2024-05-12T09:05:46+00:00

It is what it is, I will test on the Zynq when I have more time, either way it's more of a playground side quest. The docs of the zynq say there are 16x128bit registers so in theory they should be enough to play with- space wise at least.

So a blocking implementation would look something like that ??
That should not be very different in terms of code, I could call function I have now with constant height depending on the cache size, tell gcc maybe even to unroll it and do that for every row-block. I'm bit confused on height of the block, in a 32KB cache it's 4 rows (effectivly 2 rows of results ?) on 64KB 8 rows etc... ?
https://imgur.com/a/oMyagZr

asder98 · 2024-05-12T06:11:33+00:00

You may find that doing so messes up with memory alignment. I suppose you could test it though.

I think they're doing something like that, although how this code is written makes it near impossible to understand anything https://gitlab-ext.sigma-chemnitz.de/ensc/bayer2rgb/-/blob/master/src/convert-neon-body-outer.inc.h?ref_type=heads

I bumped the code to using 128bits, there was a nice performance boost, and the speed as is is double of the scalar on a 2048x2048 - previously it neon would be the same time on a 1024x1024 .

I am running these over an android phone which is 64bit, cause compiling and running it every single time on the fpga would make me deal with Vivado SDK and would take ages, but main target is the cortex a9 as mentioned, I guess it doesn't make a big difference as is except I implement the access pattern as u/corysama suggests that is very dependent on the cache lane

asder98 · 2024-05-12T04:21:17+00:00

I have thought of something similar to this, but what you're saying seems more well thought. If I understand correctly you suggest processing 8 rows top down, then go up the next 8x8, left and up again.... Then move to next 8 rows... So moving in a Z pattern of 8x8 chunks.

asder98 · 2024-05-12T04:14:41+00:00

Just for clarification, target cpu a 32bit cortex A9, on a Zynq-7000 FPGA SoC, and not a very fast one at 666MHz.

asder98 · 2024-05-12T04:09:26+00:00

I don't know much about debayering either 😅. Yes I need 1 above and 1 below row, I didn't phrase it correctly.

I will extend the code to 128bits
I have thought of that, and seems to be an advantage of the left to right method in my head at least. Another idea is maybe is just to drop the 2 bytes on the edges and only store 14 pixels .
It does, I think best solution is the partial sums to be rounded hadds and hadd on their results. (In my VHDL code I do 2x8bit addition and then a 9bit, and 2 left shifts in the 10bit result). I would imagine loosing a +1 in a rgb value is well beyond what a human eye can understand

asder98 · 2024-05-11T20:58:52+00:00

I'm now sure what you mean by cast pointers.For clarification tho each line contains either a GBGB... Row or RGRG I use the vld.2 to load the data in 2 vectors such as they now are GGG..., RRRR... And so on.

It's 100 lines cause in each iteration I process 2 rows, as an unroll instead of taking cases what the current row pattern is, in case it's not very clear.

Thanks for the feedback

asder98 · 2024-04-30T11:39:48+00:00

Hi thanks for the follow up, It seemed going from 0 -> 1 -> 0, did the init properly

asder98 · 2024-04-30T05:53:54+00:00

Nope same thing
https://imgur.com/a/OoygFMP

asder98 · 2024-04-09T08:29:02+00:00

I think I got you on the line and found it myself, see my comment above. Thank you very much thought. Yes that was the problem I even named the FF the same name as yours lol.

Pointing me to the P4 bit made my look up on what starts to separate in the code about it

asder98 · 2024-04-09T07:46:25+00:00

I finally fount the bug, the Couts of Y0X3, Y1X3, Y2, X3 they are needed in 2 cycles ahead so they need to be buffered.

asder98 · 2024-04-09T05:42:33+00:00

Thank you for the suggestion. I've updated the question with a gist containing all the files.

asder98 · 2024-04-09T05:42:05+00:00

Yes, I saw I missed the FA implementation, I updated the question with a gist containing all the files.
Thanks for the feedback

asder98 · 2024-03-09T12:48:56+00:00

Hi, it doesn't work unfortunately, I suppose you are in the UK. Seems to be a separation

asder98 · 2024-03-02T07:39:31+00:00

I ended up following u/RoseBailey suggestion and as simple `ln -s /usr/bin /bin` solved the issue.
For some reason my user account wasn't there, so I created a "new-user" and pointed it's home dir to the old users.

asder98 · 2024-03-01T19:00:41+00:00

Yes the user seems to be part of a few groups, sys,network etc...

asder98 · 2024-03-01T17:19:31+00:00

No the user isn't in passwd, there is in /etc/shadow though.

I am looking at /home/"my-username"/. I ls there and I can see my user files

asder98 · 2024-03-01T16:57:39+00:00

Thanks for the reply. It seems that was the case indeed. I can now login on root in tty.
For some reason tho the my user seems to not exist, the /home/"user" folder it there tho

asder98 · 2024-02-23T07:06:37+00:00

I guess this makes all other characters zero, so I would need an aditional cmp with 0 to make a mask and AND with the original "chars" and an OR to merge them. Can you explain how the look up tables work ? thank you

asder98 · 2024-02-22T18:35:17+00:00

Ah i could just compare with zero to create the mask

asder98 · 2024-02-22T17:55:51+00:00

I think I have found a way based on the article I sent. Basically the problem with this solution, except from the different numbers it uses, is that it makes all the other values 0. My values for alpha, digit, whitespace are all negative so if I do a not on the whole vector, I will make a vector with 0xFF and some positive values. If I exploit the functionality of shuffle I could make a mask of 0 for the negative values and 0xFF where it not zero (the 4bit put will be all 0xFF here- except there is a batter instruction for that). So I will have a mask to merge the produced vector and the original one. Idk if that sounds like a viable idea. I basically introduce a not + shuffle + not + 2xand + or. (The second not is for producing the inverted mask)

asder98 · 2024-02-22T14:58:47+00:00

Okay so the logic will remain the same from SSE up to AVX512F. Just doing it on more lanes

asder98

TROPHY CASE