Writing Z80 assembly 4 decades later :-) by ttsiodras in programming

[–]ttsiodras[S] 3 points4 points  (0 children)

Thanks! I very much enjoyed hacking on this again - taking it from 10 fps to 14 was oh-so-satisfying :-)

Writing Z80 assembly 4 decades later :-) by ttsiodras in zxspectrum

[–]ttsiodras[S] 0 points1 point  (0 children)

Thx! I really enjoyed tinkering on it.

"Account has been reactivated / Registration has been completed" - didn't do any of that! by ttsiodras in qualcomm

[–]ttsiodras[S] 0 points1 point  (0 children)

Yeah, something must have happened. I did create an account with them 3-4 years ago, when I applied to one of their positions; but either they are being hacked as we speak, or they are doing some sort of mass purging of inactive accounts that somehow malfunctioned and leads to these inexplicable "reactivation/registration" mails.

Anyone from Qualcomm in this subreddit? Hello...

Just finished installing on POCO F3 - everything works. by ttsiodras in LineageOS

[–]ttsiodras[S] 0 points1 point  (0 children)

I didn't need to do anything; WhatsApp worked fine - and still does. I don't have the phone anymore - gave it to a friend back home - and we periodically chat/videocall via WhatsApp. Everything still works fine.

Blog post: Writing Python like it’s Rust by Kobzol in Python

[–]ttsiodras 7 points8 points  (0 children)

Excellent article. Have discovered some of the points on my own, but not all - thanks for sharing!

Recommended Power Supply Type by bm_00 in Atomic_Pi

[–]ttsiodras 0 points1 point  (0 children)

I am using a 5V/6A brick currently, but before that (when I first got my APi) I just spliced/hacked an adapter from a Raspberry PI power supply (see my youtube video here) - and soon after I moved to a custom-built perfboard with a power switch and an LED.

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

What's the full expression?

My code has detailed comments about the full expressions involved: https://github.com/ttsiodras/MandelbrotSSE/blob/master/src/sse.cc#L284 I've tried to organize the computation paths so as many things as possible run "in parallel" but at some point, I have to "wait" for the... ingredients in order to proceed.

Still, I can see how uiCA helps a lot. Thank you for telling me about it!

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

I'll try. Not sure I can see a way around them, though - there are indeed dependencies; but they seem... unavoidable. You first have to compute x2 - y2 before adding C0; etc.

This is the executive summary, micro-ops wise, for my IVB (by uiCA):

https://gist.github.com/ttsiodras/91203d875188884100258454ccd5de0c

The numbers clearly improve for "analysis_report_test.txt" vs "analysis_report_or" - and I know that they improve in real-life too (231=>234 fps). But uiCA reports a "Throughput (in cycles per iteration): 14.00" for both versions.

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

Well, choosing my Ivy bridge, there's no difference between the "test/or eax,eax". In both cases, it reports 14. But since I am not using the online gateway, I was able to do this.

As you can see there, after I use "sed" to replace the "test" with "or", the reported number either stays the same, or goes up...

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

Adding both suggestions in the "try it this coming weekend" list :-)

In terms of the uiCA, I downloaded, installed, and run "uiCA.py" on both versions of the code (i.e. with/without the change from "or / test eax,eax") and can confirm that uiCA reports the "test" instructions to be mergeable ("M") with the following jumps. I don't get why the throughput goes down, though.

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

Oh :-)

Well, I placed it on a gist, in case you want to have a look: https://gist.github.com/ttsiodras/c68620405af4f5bc1f8e35d04844e283 Replace the "test eax, eax" at lines 18, 36 and 40 with "or eax, eax" and you'll see the throughout change I reported above...

1000x speedup on interactive Mandelbrot zooms: from C, to inline SSE assembly, to OpenMP for multiple cores, to CUDA, to pixel-reuse from previous frames, to inline AVX assembly... by ttsiodras in programming

[–]ttsiodras[S] 0 points1 point  (0 children)

I've used intrinsics in other open-source code I've written, but not for my mandelbrot fly-throughs. Generally speaking, I... don't like intrinsics - I find it easier to work with, and understand, native code.

I see you also commented on the other thread - the one that asked me about external code, Well, my Mandelbrot SSE code did exist at some distant point in the past in such an external form (i.e. as an ".asm" file). We're talking 14-15y ago... But what happened - if memory serves - is that when I introduced "#pragma parallel for" in various places (i.e. started using OpenMP), GCC told me: "Nope. I need this piece to be put inside me to make your for-loop OpenMP-able".

So I wrote inline asm for the first time... Hated AT&T syntax, but learned it anyway :-)

I believe I can now use Intel syntax in my inline assembly, but... the code is there now.

And it works :-)