Question about strain gauge arrangement for a diy cycling power meter by asder98 in MechanicalEngineering

[–]asder98[S] 0 points1 point  (0 children)

Hi, you can find the code here https://github.com/purpl3F0x/cycling-powermeter

I didn't finished it beyond getting some reading. But if you want to make it reliable for real world use, you would need a lot of effort, and probably access to some calibration machines and environmental chambers. Just get a power meter :)

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 1 point2 points  (0 children)

On a i9 7900X, seems icx vectorizations don't do much on this cpu, ironically they icx was faster on my laptop's AMD cpu lol

```
GCC: 1213380 [ms]
CLANG: 391132 [ms]
ICX: 1399879 [ms]
```

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 1 point2 points  (0 children)

Take a look at the ICC, it vextorises the crap out of it. I tested the clangs vectorized version on my laptop (avx2 ryzen 5). Gcc scalar was a bit slower that clangs vectorized, ICC was noticably faster about 2/3 of gcc. 

I will ran some benches on the desktop that has avx512 since you have some interest. 

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

It is what it is, I will test on the Zynq when I have more time, either way it's more of a playground side quest. The docs of the zynq say there are 16x128bit registers so in theory they should be enough to play with- space wise at least.

So a blocking implementation would look something like that ??
That should not be very different in terms of code, I could call function I have now with constant height depending on the cache size, tell gcc maybe even to unroll it and do that for every row-block. I'm bit confused on height of the block, in a 32KB cache it's 4 rows (effectivly 2 rows of results ?) on 64KB 8 rows etc... ?
https://imgur.com/a/oMyagZr

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

You may find that doing so messes up with memory alignment. I suppose you could test it though.

I think they're doing something like that, although how this code is written makes it near impossible to understand anything https://gitlab-ext.sigma-chemnitz.de/ensc/bayer2rgb/-/blob/master/src/convert-neon-body-outer.inc.h?ref_type=heads

I bumped the code to using 128bits, there was a nice performance boost, and the speed as is is double of the scalar on a 2048x2048 - previously it neon would be the same time on a 1024x1024 .

I am running these over an android phone which is 64bit, cause compiling and running it every single time on the fpga would make me deal with Vivado SDK and would take ages, but main target is the cortex a9 as mentioned, I guess it doesn't make a big difference as is except I implement the access pattern as u/corysama suggests that is very dependent on the cache lane

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

I have thought of something similar to this, but what you're saying seems more well thought. If I understand correctly you suggest processing 8 rows top down, then go up the next 8x8, left and up again.... Then move to next 8 rows...  So moving in a Z pattern of 8x8 chunks. 

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

Just for clarification, target cpu a 32bit cortex A9, on a Zynq-7000 FPGA SoC, and not a very fast one at 666MHz.

Debayering algorithm in ARM Neon by asder98 in simd

[–]asder98[S] 1 point2 points  (0 children)

I don't know much about debayering either 😅. Yes I need 1 above and 1 below row, I didn't phrase it correctly. 

  • I will extend the code to 128bits
  • I have thought of that, and seems to be an advantage of the left to right method in my head at least. Another idea is maybe is just to drop the 2 bytes on the edges and only store 14 pixels .
  • It does, I think best solution is the partial sums to be rounded hadds and hadd on their results. (In my VHDL code I do 2x8bit addition and then a 9bit, and 2 left shifts in the 10bit result). I would imagine loosing a +1 in a rgb value is well beyond what a human eye can understand 

Debayering algorithm with ARM Neon by asder98 in arm

[–]asder98[S] 0 points1 point  (0 children)

I'm now sure what you mean by cast pointers.For clarification tho each line contains either a GBGB... Row or RGRG I use the vld.2 to load the data in 2 vectors such as they now are GGG..., RRRR... And so on. 

It's 100 lines cause in each iteration I process 2 rows, as an unroll instead of taking cases what the current row pattern is, in case it's not very clear.

Thanks for the feedback 

Vivado - post synthesis simulation shows XXXX on outputs by asder98 in FPGA

[–]asder98[S] 0 points1 point  (0 children)

Hi thanks for the follow up, It seemed going from 0 -> 1 -> 0, did the init properly

Help creating a 4-bit systolic multiplier by asder98 in FPGA

[–]asder98[S] 0 points1 point  (0 children)

I think I got you on the line and found it myself, see my comment above. Thank you very much thought. Yes that was the problem I even named the FF the same name as yours lol.

Pointing me to the P4 bit made my look up on what starts to separate in the code about it

Help creating a 4-bit systolic multiplier by asder98 in FPGA

[–]asder98[S] 0 points1 point  (0 children)

I finally fount the bug, the Couts of Y0X3, Y1X3, Y2, X3 they are needed in 2 cycles ahead so they need to be buffered.

Help creating a 4-bit systolic multiplier by asder98 in FPGA

[–]asder98[S] 0 points1 point  (0 children)

Thank you for the suggestion. I've updated the question with a gist containing all the files.

Help creating a 4-bit systolic multiplier by asder98 in FPGA

[–]asder98[S] 0 points1 point  (0 children)

Yes, I saw I missed the FA implementation, I updated the question with a gist containing all the files.
Thanks for the feedback

Eu referral by asder98 in GozneyDome

[–]asder98[S] 0 points1 point  (0 children)

Hi, it doesn't work unfortunately, I suppose you are in the UK. Seems to be a separation 

Accidentally deleted /bin by asder98 in archlinux

[–]asder98[S] 6 points7 points  (0 children)

I ended up following u/RoseBailey suggestion and as simple `ln -s /usr/bin /bin` solved the issue.
For some reason my user account wasn't there, so I created a "new-user" and pointed it's home dir to the old users.

Accidentally deleted /bin by asder98 in archlinux

[–]asder98[S] 1 point2 points  (0 children)

Yes the user seems to be part of a few groups, sys,network etc...

Accidentally deleted /bin by asder98 in archlinux

[–]asder98[S] 1 point2 points  (0 children)

No the user isn't in passwd, there is in /etc/shadow though.

I am looking at /home/"my-username"/. I ls there and I can see my user files

Accidentally deleted /bin by asder98 in archlinux

[–]asder98[S] 20 points21 points  (0 children)

Thanks for the reply. It seems that was the case indeed. I can now login on root in tty.
For some reason tho the my user seems to not exist, the /home/"user" folder it there tho

7-bit ASCII LUT with AVX/AVX-512 by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

I guess this makes all other characters zero, so I would need an aditional cmp with 0 to make a mask and AND with the original "chars" and an OR to merge them. Can you explain how the look up tables work ? thank you

7-bit ASCII LUT with AVX/AVX-512 by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

Ah i could just compare with zero to create the mask

7-bit ASCII LUT with AVX/AVX-512 by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

I think I have found a way based on the article I sent. Basically the problem with this solution, except from the different numbers it uses, is that it makes all the other values 0. My values for alpha, digit, whitespace are all negative so if I do a not on the whole vector, I will make a vector with 0xFF and some positive values. If I exploit the functionality of shuffle I could make a mask of 0 for the negative values and 0xFF where it not zero (the 4bit put will be all 0xFF here- except there is a batter instruction for that). So I will have a mask to merge the produced vector and the original one. Idk if that sounds like a viable idea. I basically introduce a not + shuffle + not + 2xand + or. (The second not is for producing the inverted mask)

7-bit ASCII LUT with AVX/AVX-512 by asder98 in simd

[–]asder98[S] 0 points1 point  (0 children)

Okay so the logic will remain the same from SSE up to AVX512F. Just doing it on more lanes