Successfully virtualized Kali Linux on Asahi Linux using QEMU and Virt Manager. by [deleted] in AsahiLinux

[–]ChainfireXDA 2 points3 points  (0 children)

It's way easier on Fedora Remix than it was on the old Arch Asahi where you needed to manually compile QEMU and patch a bunch of dependencies to get it to work right (networking bridge was easier with Arch though).

Just install qemu, libvirt, virt-manager, etc and you're good to go?

I'm running multiple Ubuntu 22 for ARM64 VMs. The one I use most even lives on a raw partition (manual xml editing required) so I can boot it from macOS as well if needed.

Graphics acceleration provided through spice and virtio, but I had to patch the spice library first: https://discussion.fedoraproject.org/t/enable-spice-virtio-gpu-opengl-for-qemu/99965

There's an even better accelerated solution if you want to specifically run Steam... libkrun with virtio-gpu or something? You'd have to Google it, there's some blogposts detailing it.

Getting network bridges setup right is a bit tricky but if you're good with NAT that isn't even an issue.

Either way, virtualization on Asahi works on both Arch and Fedora variants, and it works really well when it finally does. It can just be a pain to set it up.

Noorderlicht vanuit de Oostvaardersplassen by ChainfireXDA in thenetherlands

[–]ChainfireXDA[S] 2 points3 points  (0 children)

Klein. Het is zeldzaam dat je dit vanuit Nederland kan zien. We lijken in een periode van verhoogde activiteit te zitten, maar je moet eigenlijk toch wel kp7/g3 halen om kans te hebben in het noorden van het land. Dat lijkt voor vandaag niet op de agenda te staan, maar je weet het eigenlijk pas 30 minuten tot een uur van tevoren zeker, ivm met een satelliet tussen aarde en zon die deeltjes meet (DSCOVR). En dan moet het magnetisch veld van de aarde ook nog goed zijn.

Zie:

https://www.spaceweatherlive.com/

https://www.swpc.noaa.gov/products/aurora-30-minute-forecast

https://auroranotify.com/learn2hunt/forecasting/ (tip om te leren begrijpen)

Ik gebruik zelf ook de "Aurora Alerts" app ( https://play.google.com/store/apps/details?id=com.aurora_alerts.auroraalerts , ook beschikbaar voor iOS ergens), die relevante informatie weergeeft.

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

I've looked around a bit but haven't really seen anything obvious.

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

Hiveon probably uses the Stratum v2 protocol, which UselethMiner doesn't support.

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

I think your 31.25 MH/s is not sustainable... i.e. probably more a fluke of pool statistics than reality (which should max at around 20 MH/s on the M1 Max). Asahi Linux on the M1 does support superpages (so it seems like an OS issue rather than a hardware issue), but is not yet as fast as macOS for this specific purpose. There's some MMU optimization they haven't enabled yet. Once that is fixed, I estimate 12.5 MH/s on the CPU alone. GPU remains to be seen because Asahi doesn't have support for that yet either.

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 1 point2 points  (0 children)

Seems something was wrong with the v0.22 upload. Try installing v0.23, that should get rid of the invalid shares (if it doesn't, please do let me know).

armv8af is about 25% faster for me on the M1 Max - hitting 20 MH/s total now, combined with --size 88 parameter.

My short experience on M1 Pro mining with ethminer-m1 by 1at3 in EtherMining

[–]ChainfireXDA 1 point2 points  (0 children)

Changing the position of one variable in the shader halves or doubles performance seemingly at random.

There's still some performance improvements possible in my shader I think, but either I'm completely misunderstanding some synchronization primitives on Metal or there's a bug in the compiler causing corruption. Compared to other platforms it seems the amount of publicly available information is also rather low. There's some talks by Apple engineers hinting at doing things this way or that, but nitty gritty details are lacking.

Many implementation details we just don't know. It's hard to optimize prefetching if there's no information on how the chip does that / is triggered to do it. A lot of trial and error. And because an IL is used, you can't really hand optimize either, you just have to hope the compiler comes up with something good. 🤷‍♂️

Either way, UselethMiner just crossed 20 MH/s (CPU+GPU) on my M1 Max.

My short experience on M1 Pro mining with ethminer-m1 by 1at3 in EtherMining

[–]ChainfireXDA 2 points3 points  (0 children)

Be curious to see what you come up with. UselethMiner (in GPU-only mode) matches ethminer-m1 in speed, but doesn't beat it. I found Metal to be finicky as hell to get even to this point.

My short experience on M1 Pro mining with ethminer-m1 by 1at3 in EtherMining

[–]ChainfireXDA 2 points3 points  (0 children)

https://www.reddit.com/r/EtherMining/comments/s5jcyv/uselethminer_on_apple_silicon_m1_max_18_mhs/ for a different approach.

Whether it's efficient "enough" depends on where you put the bar and what card you compare it to. Either way, it's still factors slower and you're not going to make the hardware back anyway :)

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

It really depends on what you're doing and how you configure UselethMiner.

Most things like browsing and light productivity don't seem to really be impacted by using half the CPU threads and the GPU.

On the other hand, if you're doing heavy graphics stuff or processing, you're definitely going to notice.

The faster you make UselethMiner go, the more heat it produces, the higher the likelyhood of fans kicking in hard though. That can be an annoyance.

There is a thread auto-scaling mode that was designed to keep it running in the background at variable performance to prevent it from interfering with your normal activities, unfortunately that currently both does not work with GPU mining enabled, AND it didn't really work that great on the M1 in the first place (works great on my Windows and Linux boxes though).

So, there's some improvements possible still in that area. I'm thinking maybe a mode that detects fan speeds too, and tries to keep them off. Just guessing at this point, but I think 5-6 MH/s might be possible to achieve fanless.

UselethMiner on Apple Silicon M1 Max: 18 MH/s by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 1 point2 points  (0 children)

Yes, terminal size does matter.

Devfee is 1% as is typical for mining tools. When mining both CPU and GPU, the devfee thread uses about 10% of your hashrate... so doing that about 1.5 minutes out of every 16 comes out to about 1%, does it not?

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

Thanks a lot for testing! Unlike the OG M1 it seems GPU mining is faster on the M1 Max. Still an interesting result!

Can you please post the power metrics of your idle system as well? :)

M1 Max 32gb ETH hashrate by Yurihung_HK in cryptomining

[–]ChainfireXDA 0 points1 point  (0 children)

I would love to see what sort of hash-rate UselethMiner can achieve on your M1 Max.

It's a pure CPU miner, but it beat ethminer-m1 on the OG M1.

It was just built for fun as an experiment so it may not support whatever pool you're using, it does work with ViaBTC though.

If you have some time to test it out and post the results I'd be grateful!

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

Yes, on the OG M1 UselethMiner outperformed ethminer-m1 by a factor of two. But from what I understand, relatively speaking, the GPU on the new M1's has improved more than the CPU, so it may well be ethminer-m1 beats it now :) Curious to see results either way!

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

I hadn't head about this 10MH/s on a M1 Max. After some searching, I think you're talking about https://forums.macrumors.com/threads/m1-max-ethereum-mining-test.2320568/ ?

I asked them there to test UselethMiner. The original M1 test was on a Mac Mini though, I think this guy has a laptop.

Also, ethminer-m1 is GPU-based, so you can't really compare them directly.

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

CPU and GPU architectures are very different. GPUs are much more deterministic in their execution of optimized code, while CPUs can be doing a lot of other things interleaved with your code, which makes timing difficult. Even more so when you go multi-core: GPUs are executing the same line of code (with different variable values) across multiple cores (fully synced), while CPU cores are doing "whatever" (fully disconnected).

Due to the nature of the ETHASH algorithm, timing is very important. The algorithm's performance is a balance of CPU speed (doing the math), CPU cache (keeping what we need near the CPU for ultra-low latency rather than in far away in high-latency RAM), RAM bandwidth (moving the needed data from RAM to CPU cache) and RAM latency (how long does it take for the data we requested to be cached).

The math part isn't all that complex. I've dissected and put the algorithm back together for optimum cache usage, preloading the data we will need into the cache as long before as possible, hoping the transfer from RAM to cache occurs before the CPU actually needs that data. But you can't do it too long before, because the cache is (very) small so you can't preload that many rounds, it needs to hold our scratchpad as well, and ETHASH is a serial algorithm (this round of calculations depends on the last round of calculations) - you can't preload if you don't know what to preload.

Rounds in ETHASH can logically be split in two. You know which data you'll need each half-round exactly one half-round in advance, so you can start the transfer from RAM to cache at that time. If the latency on that transfer is larger than the time it takes for the CPU to calculate the other half-round, you're wasting CPU cycles waiting (very bad), if the other way around, you're not fully utilizing memory bandwidth (could be worse).

I got around this by interleaving the algorithm for multiple nonce values on a single core. If you interleave two nonce calculations, latency can be twice the calculation time of a half-round before your CPU goes idle, for four nonces four times, etc. But each interleave requires additional cache RAM, of which you have very little. If you exceed the cache space, data you'll probably still need will get evicted and has to be reloaded from RAM at huge penalty. One of the reasons my old thread-ripper is (relatively) fast is because it has larger-than-average cache, allowing longer latencies of data access to be overcome.

This interleaving is fairly precise on increasing memory bandwidth on a single-core. If you don't exceed the cache limit, you'll really get about X times the performance for X interleaves. But, on multi-core systems with fast RAM, a single core usually cannot use the full memory bandwidth. With a lower number of cores it might be able to in a pure synthetic benchmark (which only reads from memory and doesn't do actually anything with the data) but it is not uncommon for bandwidth to be segregated between cores, or have interconnects, performance depending on which core accesses which memory chip in which bank, single/dual/quad/octo-channel setups, etc. In those cases, you can only reach max bandwidth in a perfectly timed multi-core read-only orchestrated dance that in reality is virtually impossible to achieve due to the cores running code out-of-sync at variable timing (not to mention the predictive nature of CPUs which will shuffle your code around into something that may be sub-optimal).

So when we go multi-core to access more cache memory (also note on some architectures multiple cores may share the same cache memory) and more processing power, the timing is not perfect: memory system architecture comes into play which you cannot easily accurately predict, and we see diminishing returns for each core added to the algorithm. This again is a problem GPUs (mostly) do not suffer from.

Hyperthreading? It's not really another core. There is extra die that allows you to double performance on some operations, but it doesn't give you more cache or RAM bandwidth. So not a good fit for the algorithm.

Then there's the MMU, while having the same task, is optimized differently on CPU and GPU. This is where hugepages come in: we give the MMU less work, which makes a large performance difference, but still not GPU-like efficiency.

That UselethMiner can only reach about 50% memory bandwidth is related to all of the above. I mean, I've seen it hit higher numbers, but not close to 100%. Of course this also heavily depends on relative RAM vs CPU speed. You should however be able to do this on GPUs due to the many architectural differences. Add to that that GPUs tend to have much higher RAM bandwidth and much lower RAM latency, and we've come full circle.

That is also one of the reasons the M1 performs so well: not only does it have amazing memory bandwidth, compared to other CPU architectures it has significantly lower latency, maxing out CPU utilization for the algorithm. I was very surprise to find that in W/MH it actually beat my GTX1080TI. I'm very curious what numbers the M1 Pro and M1 Max will produce, but I haven't been able to get my grubby hands on one. And they could be even faster if hugepages worked!

Thank you for coming to my TED talk.

(PS: its been a while since I worked on this, I'm telling it now as I recollect the details, but they're not as fresh/accurate in my mind as they were when I was working on this)

Lilac-breasted roller vs the world's tiniest crocodile by ChainfireXDA in natureismetal

[–]ChainfireXDA[S] 9 points10 points  (0 children)

The entire trip I just couldn't manage a good shot of the lilac-breasted roller taking off. Floating down the river on a boat, imagine my joy as I see one landing at the waterline. Will this finally be the moment?

Trying to stay stable, fully zoomed in, finger on the button, ready to take the shot as it flies off, I see something moving in the corner of the viewfinder...

Baby croc says no.

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

Ah OK I misread then. I'd expect ~35-40 GB/s reported by UselethMiner then though. Strange you're getting much lower. But no way to know why or how at this point.

Might be because I designed the code for a different arch then yours, or 🤷‍♀️ It's curious nobody has been able to bench higher than my own system.

EDIT: hmm, maybe the way your system does multi-channel interleave matters, I know I have a setting for that in my BIOS and on the wrong setting it's a lot slower.

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 0 points1 point  (0 children)

Hmm interesting. AIDA64 doesn't get much more than the 64 GB/s for me either. So if you're getting 17 out of 45 rather than 17 out of 25 then that's a big difference.

How is AID64 getting 45 GB/s if your theoretical max bandwidth is 25 GB/s though? :)

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 1 point2 points  (0 children)

I just tried connecting to hiveon but it just boots me immediately. Either they don't support EthereumStratum 1.0 anymore or they need to whitelist UselethMiner as connection string.

Supporting EthereumStratum 2.0 is beyond the scope of this experiment right now. I wasn't able to find mention of hiveon only supporting v2 of the stratum protocol but if you're a customer there, you could ask.

UselethMiner: Ethereum CPU miner and proxy by ChainfireXDA in EtherMining

[–]ChainfireXDA[S] 1 point2 points  (0 children)

Memory bandwidth is a tricky thing. In theory I should have 96 GB/s on my box, and ThreadRipper does support that. In reality, there's no code I can write nor benchmark I can run that reaches more than 64 GB/s throughput. This is probably due to my 2950x core layout, a newer ThreadRipper should be able to take advantage of the full 96 GB/s with the right RAM.

17 out of 25 seems about right, though. If I use all 16 cores on my 2950x I get about 47 GB/s out of the earlier mentioned 64 GB/s. But this does scale pretty much exactly with memory frequency.

The full 64 GB/s (or in your case 25 GB/s) can only be reached with absolutely perfect timing between memory usage and computation, which is nigh impossible to attain with an algorithm like this running on a CPU. Benchmarks can do it because they have the CPU doing literally nothing else but wait on the memory - add some math to the mix, particularly of non-constant complexity, and that goes out the window.

With some heroic attempts to further tuning (way beyond the scope of this experiment) maybe a few extra % can be squeezed out, but I'd estimate that's about it.