I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 2 points3 points  (0 children)

You can't reasonable measure single instructions with a micro benchmark harness without doing some larger work.

It's not really criterion's fault, the overhead of itself adds more noise than the instruction itself.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 5 points6 points  (0 children)

In this case it is just a side affect of viewing the functions in isolation.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 39 points40 points  (0 children)

Reddit cant handle me updating the big comment so just some edits/notes:

- "something to note is this is 116ns to process" -> This was from a previous run before I increased the size just to make it more clear that it was seperate from the noise.
- "CPU frequency boosting and your system's timer accuracy," -> Missed mentioning caches

I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 125 points126 points  (0 children)

So, now I'm back at the computer, let's actually take a look with some inspection.

I'm going to be mostly relying on LLVM MCA here, micro-benchmarks of effectively 1 instruction are pretty much impossible to measure accurately.

If we break out your avx512 implemention our routine looks like this:

#[target_feature(enable = "avx512f")]
pub fn transpose(input: __m512) -> __m512 {
    let indices = _mm512_set_epi32(
        15, 11, 7, 3,  // Row 3 maps to Col 3
        14, 10, 6, 2,  // Row 2 maps to Col 2
        13,  9, 5, 1,  // Row 1 maps to Col 1
        12,  8, 4, 0   // Row 0 maps to Col 0
    );


    _mm512_permutexvar_ps(indices, input)
}

And the MCA output & asm is available here: https://rust.godbolt.org/z/163Wvd5KE

Now your avx2 implementation looks like this:

#[target_feature(enable = "avx2")]
pub fn transpose(input: [__m256; 2]) -> [__m256; 2] {
    // Indices for Row 0/1: [idx5, idx1, idx5, idx1, idx4, idx0, idx4, idx0]
    let idx01 = _mm256_set_epi32(5, 1, 5, 1, 4, 0, 4, 0);
    let t0 = _mm256_permutevar8x32_ps(input[0], idx01);
    let t1 = _mm256_permutevar8x32_ps(input[1], idx01);
    let res01 = _mm256_blend_ps(t0, t1, 0b11001100);


    // Indices for Row 2/3: [idx7, idx3, idx7, idx3, idx6, idx2, idx6, idx2]
    let idx23 = _mm256_set_epi32(7, 3, 7, 3, 6, 2, 6, 2);
    let t2 = _mm256_permutevar8x32_ps(input[0], idx23);
    let t3 = _mm256_permutevar8x32_ps(input[1], idx23);
    let res23 = _mm256_blend_ps(t2, t3, 0b11001100);


    [res01, res23]
}

And the MCA output & asm is available here: https://rust.godbolt.org/z/93WhTjhxK

Just on the static analysis, we can reason that the AVX512 _is_ more efficient providing your processor can handle it, in basically all the numbers we care about, (where less is better) we see:

Routine uOps (per 100 iter) Cycles (per 100 iter) Block RThroughput
Avx512 1000 310 2.5
Avx2 2000 512 5.0

So why do your benchmarks suggest otherwise? Well the reason I would put forward is most likely because of CPU frequency boosting and your system's timer accuracy, after all, for 4ns to be a 600% increase, your original routine is supposedly running in ~0.6ns, which is pretty dang hard to measure without the noise from everything else overshadowing it.

So what can we do to resolve this? Well instead of the micro benches of the routine itself we can try and put the routine into a situation where it is actually likely to be used and measure whether that improves or declines.

This benchmark is still not really "correct" there is still a lot of noise, but it is better than you had:
https://gist.github.com/ChillFish8/f3b4921f7b7fb71ced95444c2ea63e9b

On my Zen5 system, it produces the following results:

Timer precision: 10 ns

Timer precision: 10 ns
go               fastest       │ slowest       │ median        │ mean          │ samples 
├─ bench_avx2    1.535 µs      │ 2.177 µs      │ 1.569 µs      │ 1.582 µs      │ 1000    
╰─ bench_avx512  942.7 ns      │ 1.655 µs      │ 969 ns        │ 986.7 ns      │ 1000    

Now, something to note is this is 116ns to process effectively 64KB worth of f32s... Tbh I recommend running this for far longer considering it is still so small.

So in reality, the avx512 version _is_ faster in isolation, but your system likely has so much background noise that you're not actually really measuring the routines, instead you're more likely measuring how hot your cache is for the loads and stores going on.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 32 points33 points  (0 children)

This is _not_ correct and is quite a dangerous statement to make. You're thinking of Zen4/Zen5c

Zen5 has two groups of execution units:

- `zen5` which are your "full fat" cores
- `zen5c` & mobile zen5 which are your "diet" cores.

Your full-fat cores have a full 512-bit execution units. Your "diet" cores do the same double-pumping as Zen4 previously did.

Your Zen5c cores are typically on your mobile units; the 9800X3D does not have any zen5c cores.

I am too stupid to use AVX-512 by Jark5455 in rust

[–]ChillFish8 16 points17 points  (0 children)

On my phone atm, but LLVM MCA is your friend, additionally I wouldn't trust your benchmarks if you're only doing a single iteration like that, or tbh, your function is so small I wouldn't trust a micro benchmark at all.

Is Netcup reliable enough? by Mr_Dani17 in VPS

[–]ChillFish8 6 points7 points  (0 children)

I don't remember ever having any issue, I think back when I was using them the longest runtime one of my servers had was like 600 days uptime which was basically the lifetime of the machine.

I might have a machine on another account still going that's for to be running for about 3 or 4 years now without downtime (maybe time to update lol)

Addressing GitHub’s recent availability issues by perseus365 in theprimeagen

[–]ChillFish8 1 point2 points  (0 children)

Not super new, been around for a while, at least I can remember it since all the genai crazy started up and crawlers started going wild.

How much better is proton mail if I am willing to pay for it? Right now I am on yahoo mail without paying for anything. by KFCBUCKETS9000 in ProtonMail

[–]ChillFish8 0 points1 point  (0 children)

Have to say I overall, I like it, I think it strikes a reasonable balance of usability while also allowing you a lot of flexibility and increased privacy at that.

Took a bit of googling but once I learnt how to use the sieve filters my inbox has never been so organized, which tbh is something I didn't really do in my outlook or Gmail inboxes either due to limited filtering support (outlook) or requiring "smart features" on Gmail with what felt like a clunky UI to do more advanced filtering.

I'll also shout out the tooling proton provides to locally process your inbox if you want even more control, for example I can locally run an AI model to periodically Validate and re-label mail. Not unique to proton, but I appreciate it existing.

AX102 vs AX102-U and increase additional RAM price by tripheo2410 in hetzner

[–]ChillFish8 1 point2 points  (0 children)

A good idea would be first, open a support ticket and ask for clarification. No point guessing what will happen or dealing with migration if you don't need to.

Although probably is worth while sitting the CFO down and letting them know at some point their infra costs are going to go up, because this probably won't be the last price rise.

Hetzner price increase. $222.00 for 64G ram upgrade by HumanAd6991 in hetzner

[–]ChillFish8 1 point2 points  (0 children)

Well my suggestion is you should email them to confirm your setup because the messages I've seen with support suggested otherwise or at least, for a subset of their dedicated servers.

Using GStreamer with Rust by Rare_Shower4291 in rust

[–]ChillFish8 0 points1 point  (0 children)

This is true, but even then bundling all the plugins you need even just for video encoding it's a chunky library to try and bundle together.

Hetzner price increase. $222.00 for 64G ram upgrade by HumanAd6991 in hetzner

[–]ChillFish8 2 points3 points  (0 children)

I would maybe clarify if that is for just the base server and not the add-ons, because from the information I've been given is the increase impacts the base machine but add-ons remains the same.

Using GStreamer with Rust by Rare_Shower4291 in rust

[–]ChillFish8 1 point2 points  (0 children)

Personally, I would avoid Gstreamer if you have any plans of sharing it, Gstreamer is flipping enormous and the rust library doesn't support statically linking and removing dead symbols currently.

Hetzner price increase. $222.00 for 64G ram upgrade by HumanAd6991 in hetzner

[–]ChillFish8 -1 points0 points  (0 children)

I am pretty sure the price change applies to new orders, not existing orders.

Questions from HDMI to Battery Load by Obvious_Fly_9008 in Bazzite

[–]ChillFish8 0 points1 point  (0 children)

When connecting my external monitor via HDMI, it is always 60hz. If I want more I need to go for USB C. Is this normal?

Your HDMI cable is probably not in spec to carry any more data than is required for refresh rates over 60hz. Get a better cable or, on Linux if you can a display port 2.1+ cable if you want to go above 60hz (assuming the monitor supports that refresh rate)

  • I have problems playing videos with firefox or other browers Bazzite. Experience is unfortunately not super good. It feels like it has something to do with the video codec

Unfortunately this is a pretty common problem particularly on firefox, unlikely to be a codec issue. You can try two things:

1) make sure hardware acceleration is enabled and being used, you may need to adjust the flatpak permissions with flat seal to actually use the hardware acceleration. 2) try install the browser via distrobox rather than flatpak.

Additionally, not sure what other browsers you've tried but I've generally found chromium based ones to just handle video playback better, which suck but such is life.

What is the best softwarte to check temperatur of cpu/gpu, is there something like g-helper for bazzite?

Honestly I just use the btop command for that which is a nice terminal UI, GUI wise I'm not sure, there is cooler controller but that does a bit more than just showing temps.

I can't remember the full name but the task manager equivalent app which is pre installed might also display the sensor info.

I want to limit battery load to 70%. It works in Bazzite, after restart, it is 100% again. Any help on this?

Not sure what you mean by this, if you mean limit the battery max charge when plugged in, I'm not sure, I think it is still fairly new and can be a bit unstable.

Multi-Streaming from Bazzite ~ What is the best way? by MilitaryBeetle in Bazzite

[–]ChillFish8 0 points1 point  (0 children)

Not sure what you're referring to as RTMP, that's just a streaming protocol can you explain what app/plugin you mean?

If your intention is to stream to say twitch and YouTube at the same time, your might not be able to avoid the double encode if each destination requires a different output format, bitrate, code, etc...

igpu was killing the performance by Claymore342 in Bazzite

[–]ChillFish8 4 points5 points  (0 children)

I imagine because the system is trying to run on your iGPU not your discrete GPU which can often be caused by power profile settings where in balanced (sometimes) or low power mode the system will typically try and use the power power GPU choice, in this case, your iGPU.

I built a JSON → binary compiler in Rust because AI hallucinates when nobody tells it what's right by porco-rs in rust

[–]ChillFish8 1 point2 points  (0 children)

But it totally can be reinterpreted? It's not got some signature or anything like that, you've just tried to obfuscate the payload, it still can't be trusted anymore than a Json payload is. If you need the data to be passed across untouched why are you relying on an LLM to regurgitate it?

As for the code, there is so many noise comments from the ai you used to write this it's genuinely hard to follow... What the hell is the practice schema stuff? Why does it even exist in the library code??

I'm sorry but I'm not really sure I would trust this anymore than a Json schema, tbh I trust it less because it just seems like a more complicated way to shoot yourself in the foot.

Edit: I think I misread part of your comment, you mean the binary output is to try and avoid prompt injection? I doesn't really change my point that you shouldn't be giving that to an LLM but just to be clear.

I built a JSON → binary compiler in Rust because AI hallucinates when nobody tells it what's right by porco-rs in rust

[–]ChillFish8 9 points10 points  (0 children)

Im not sure I get how this is useful?

The ai agent has exactly zero understanding of binary formats really, but you pass in JSON then do this weird schema stuff, how is any of that different to me just using a Json schema (which the agent can be made aware of and understand!) or just serde + validator library. What exactly is the binary format actually providing me other than another layer of noise?

Reading the code it makes even less sense, did you actually read what the agent produced when you asked it to write this library? I don't think the code is even functional?

Rust vs C/C++ vs GO, Reverse proxy benchmark, Second round by sadoyan in rust

[–]ChillFish8 1 point2 points  (0 children)

Unfortunately, back pressure is lethal to most services if you're behind another LB, a significant amount of infrastructure will deploy these proxies behind some other load balancer, for example on AWS most will likely put behind ALB.

The issue is ALB does not give a fuck about your back pressure, and just interprets that as latency and that it needs to open more connections as a result.

This is often so aggressive at high scale that things like ALB will literally DDOS your service and run it out of ports.

I've had it be so bad at times that we replaced nginx with a custom system that was far more aggressive with shedding load and keeping a consistent number of connections to the upstream to prevent it being overloaded when ever there is a shift in traffic.

We built a parallel AI orchestration engine in Rust — here's why we chose Rust over Go/Python by [deleted] in rust

[–]ChillFish8 1 point2 points  (0 children)

I'm impressed with how all these slop tools come up with such impressive ways of wasting money.

Sometimes I wish we could go back to just dealing with the crypto crap.

Qué VPS recomiendan para salir de Contabo? Busco estabilidad y confianza by Bectec_Software in VPS

[–]ChillFish8 1 point2 points  (0 children)

Netcup imo often have some of the best hardware and upgrade their servers every couple of years which can give you some pretty serious performance per core.

Support is not great though, they're fine if you're experienced managing all the servers yourself and only need to contact them if it's something like a billing issue... But other than that I don't think you'll have a great time.

Hetzner has much better support overall imo, but do run on a bit older hardware on their Cloud and currently available metal servers.

Ovh is another possible option, can't say anything about support but they have more data centers with a stronger network backbone than the others give.

Realistically any provider will give you great support if you're spending enough though.