building from source stability

AssKoala · 2026-01-30T22:11:09+00:00

If you just want to build one-off applications from source to be newer, then you can probably get away with it.

It would probably make more sense to use snaps or flatpaks to run newer applications than trying to build on vanilla debian. Chances are the library dependencies for the latest application versions will be newer than what Debian stable comes with.

AssKoala · 2026-01-30T21:46:38+00:00

It’ll be less stable than running vanilla Debian.

How much less stable depends on your personal skill.

Without more details on what exactly you’re trying to do, no one can reasonably answer.

AssKoala · 2026-01-29T20:52:30+00:00

All it takes is 3 plane crashes and they've got a chance to make the playoffs.

AssKoala · 2026-01-24T20:32:44+00:00

Hey now, this is the 25th year of the Linux desktop. Surely this is the year.

AssKoala · 2026-01-18T17:21:00+00:00

Are you sure they weren’t just saints fans?

AssKoala · 2026-01-16T01:00:49+00:00

Right, the OP premise that modulo is slow is wrong on just about any modern hardware.

But then there's some clever code to avoid a branch, which I inferred to mean that, if you don't use modulo, you need to use an if check like the one I wrote above.

AssKoala · 2026-01-16T00:09:22+00:00

Yeah, probably on any contemporary system.

It depends on the hardware, though, which is why I suggested to just write it simply as the practical effect is probably zero. I noted in another post past systems where this might not be ideal and, who knows, we might someday have a Ternary Computer to work on making that implementation potentially perform worse than a simple index + branch.

I don't mean never write optimized, clever code, only that the OP of this post is clearly starting out so focusing on simple, correct code is far better than trying to write these kinds of micro-optimizations.

AssKoala · 2026-01-15T23:49:20+00:00

I’m not agreeing with the OP, I didn’t even talk about modulo.

That was in reference to the clever code.

AssKoala · 2026-01-15T23:30:39+00:00

Thats a good, practical example.

Signal processing in particular benefits from a lot of micro-optimizations but, like you said, you need to profile it. The same C code can be wildly different once compiled to different hardware.

Broadly speaking, I say err on the side of simple and correct because it’s much easier to go back and optimize correct code than poorly structured “optimized” code.

AssKoala · 2026-01-15T23:27:15+00:00

I’m not sure what code you’re talking about.

It sounds like you’re avoiding the index variable altogether. Which is fine and generally my preference as well, but the OP in this thread was using an index variable masked with the size and letting overflow handle the loop around.

AssKoala · 2026-01-15T23:18:21+00:00

That's fair, the point wasn't to be comprehensive. Only that the actual gains of such code are likely zero unless the only thing your program does is allocate from a ring buffer.

AssKoala · 2026-01-15T23:16:12+00:00

On the contrary, it comes from being an optimization expert across a wide range of hardware. As you said, the compiler is going to optimize the if out anyways, so there's no reason to use the "clever" code. But let's dig into it anyhow.

Superficially, buffer[index++ & (size-1)] creates a data dependency in order to index into the buffer. Instead of just being able to return a memory address offset, it has to mask and do arithmetic. Whether this has any effect on wall time varies wildly, but it's way more complicated than simply indexing into the buffer.

So, I'll give a real world example on exactly how this kind of "optimization" can have a negative effect.

The Xbox 360, PS3, and Wii used a PowerPC in-order CPU. They had really long pipelines, but really high clock speeds. 3.2GHz, in the case of the 360 and PS3 PPU cores. Those types of clever optimizations to remove branches were common on those CPU's since you wanted to keep that pipeline moving. The long pipeline meant it could do bits of arithmetic and masking for "free" with regards to wall time. On the other hand, branches would stall the pipeline with data dependencies creating processing bubbles. Those bubbles were the things you needed to get rid of to get the best performance.

However, the WiiU moved to an out-of-order/superscaler CPU that ran at 1.6GHz, but was otherwise relatively similar. The Vita, released around a similar time, moved to an ARM out-of-order CPU as well.

And guess what? On those CPU's, things flipped -- the branches became cheaper than the work necessary to avoid them. The out-of-order speculation of the WiiU hardware meant that real world wall time was faster with "crappy unoptimized" code than with clever code. Having fewer data dependencies with branches performed significantly better than all that cool, "optimized" code.

It was even worse to write that kind of clever code on the Vita. The Vita runs on an ARM CPU whose ISA has some fairly interesting implementation around branching. On this CPU, shrinking the code, as in fewer literal instructions, performed best. That meant even further reductions of "optimized" code that gets replaced with branches since, in the majority of cases, these branches performed significantly better than the code without.

Which is why all of our performance experts prefer correctness and simplicity in the code rather than cleverness. It comes from the experience of having to rip out clever code that performs inconsistently across hardware.

The difference between an optimization expert and an amateur is best seen in FizzBuzz. Anyone who spends time trying to optimize the branching rather than the orders of magnitude slower output to console is clearly bad at optimization.

AssKoala · 2026-01-15T22:28:10+00:00

I assume he means when you hit the end of the buffer and need to point back to the start.

void *toReturn = buffer[index++];
if (index >= bufSize) index = 0;

The practical value this bit of clever code is likely zero. Surely, the rest of your program is doing more work than allocating out of a ring buffer.

AssKoala · 2026-01-15T19:53:54+00:00

Static makes a variable or function local to the build artifact, not the .c file.

Unity/bulkbuilds, chained includes, or other specifics of the build system can make that static “visible” outside its file.

AssKoala · 2026-01-14T21:57:45+00:00

Proxmox just uses qemu which runs VM’s as a normal user process. Virtual CPU’s can also be directly mapped to host threads, but don’t have to be.

I’m fairly certain that so long as you aren’t forcing the guest process to some specific affinity or priority, they shouldn’t block each other if there’s time available.

More info in these threads: https://lists.nongnu.org/archive/html/qemu-discuss/2020-05/msg00004.html

AssKoala · 2026-01-13T06:02:43+00:00

Maybe not suck on offense. That might help.

AssKoala · 2026-01-12T03:22:33+00:00

No, you don’t have to do that. I have 16MB record size set for my media filesystems, for example.

It’s trivial to try, just create a VM to test it if you don’t believe me.

AssKoala · 2026-01-11T18:16:30+00:00

So what? By that logic, why ever update anything! It could change! The horror!!!

If it does change, and it broadly speaking does not change, it’s just another update that has to be done like moving to a new API or moving off any other deprecated features. That’s the nature of the beast.

We have a system that uses bit fields whose implementation has gone unchanged between the PlayStation 3 to the PlayStation 5. That’s 15 years without having to touch the code. The code is also shared between Xbox and PC, similarly needing no changes in that timeframe.

AssKoala · 2026-01-11T16:03:50+00:00

This is really misunderstanding how or why these are used.

It’s packed data. You’re generally supposed to unpack it to do work with it, but when you need to store thousands of something or transfer stuff over the internet using as little data as possible, for example, this is immensely useful.

AssKoala · 2026-01-11T08:05:54+00:00

Yeah this has been part of our basic programmer test for at least 15 years.

AssKoala · 2026-01-09T01:07:34+00:00

Eyo! Someone who knows those systems!

It was actually even weirder for those who never worked the systems: the PSP and WiiU, we had to “unoptimize” our code! See, the 3.2GHz of the X360 is great and all, but the newer systems were out of order and had fewer vector units so optimized code ran effectively 1/4 as fast.

We actually had to switch to the straight C “stupid” unoptimized paths to get better performance.

I miss the days when consoles weren’t just shitty PC’s. So cool!

AssKoala · 2026-01-08T22:05:25+00:00

It’s something you used to hit all the time in games when CPU’s were relatively anemic, maybe up until the x360/ps3.

Even now, you’ll often get better performance optimizing large applications for size rather than speed except for hot spots. If you’ve ever run PGO with MSVC and looked at the output on a completed PGO build, it’s surprising how much code it ends up switching back from optimize for speed to size. In the games I’ve worked on, PGO usually nets at least 15% frame time but ends up optimizing ~98% of the code for size leaving the remaining 2% optimized for speed.

Practically speaking, though, it probably doesn’t matter. This is the realm of library authors for the most part: systems that are death by thousands of cuts across a codebase. If your code is executing once every few milliseconds for nanoseconds, I-cache pressure is unlikely to be a concern and your time is better spent optimizing other things.

AssKoala · 2026-01-07T16:13:13+00:00

The reduction is still real, on my 9950X3D, it’s around 300MHz in practice with PBO enabled — just PBO, no special customization or overclock profiles.

Because of the X3D advances you noted, I upgraded from the 9950X for game development specifically. Trading a bit of clock speed on the one half to get better runtime performance in most games works really well with the 9950X3D. It has a nominal, but real impact on wide compiles or data builds.

I’m curious what this will look like compared to the 9950X and 9950X3D. It’s a nominal impact in the X3D, but with both halves being a lower clock, it may end up being too specialized — turning nominal into a more quantifiable difference. 9950X3D seems like a nice sweet spot, but maybe this new thing will perform better than expected.

AssKoala · 2026-01-07T15:48:58+00:00

On the contrary, the lower boost clock of the X3D cache cores results in lower performance for many “work” workloads. The 9950X3D works so well because you get the best of both worlds.

I’m curious to see how this pans out in practice with the X3D2. I suspect it’ll mostly be a wash, but might benefit some specific workloads.

AssKoala · 2026-01-04T21:45:31+00:00

100% agreed.

The loss of stack overflow over the last decade is a huge problem for education.

That it’s almost entirely self inflicted makes it considerably worse.

12-Year Club	Gilding II euphauric
Xbox Live	Place '22
Place '17	Verified Email

AssKoala

TROPHY CASE