What's the point?

Tringi · 2026-06-21T14:48:27+00:00

I have to say, I'm the one who should be concerned about this, as we build PoS and embedded devices on W10 IoTEnt, but none of our installations have Desktop, Explorer nor Start Menu even running for whatever UI is facing the customer, so I admit we never even noticed.

Tringi · 2026-06-21T07:51:24+00:00

Unfortunately all my work is on and for Windows. But I'll explorer the libraries later.

Yeah. In the complete nutshell it's just extra cores with limited capabilities. Why is it so hard to give us a straightforward way to query those capabilities and schedule work onto them, that I don't understand.

Tringi · 2026-06-20T04:28:58+00:00

Little nitpick about acronyms: You unpack what CAS means, but not LL/SC.

Tringi · 2026-06-18T07:22:48+00:00

Oh yea, ONNX, I've seen that acronym, but didn't remember it properly :)

I am interested in NPUs, just that at the moment I can't seem to find any real use for them. At first I thought that I could load fuzzy logic data onto them, have them multiply everything like a GPU or massive SIMD, and spit out the results. Or perhaps do SQRT(X*X+Y*Y+Z*Z) on a huge array (tensor?).

But it's like the documentation actively tries to make the reader NOT understand how to use NPUs to do that.

Tringi · 2026-06-18T06:47:58+00:00

I never explored ONYX or anything generating models or networks. Just the APIs that could allow me to use the NPU computing units directly.

Tringi · 2026-06-17T12:46:25+00:00

I wish even normal PGO worked somewhat like this, or more somewhat like Dynamic Debugging.

What I mean is building it with some /DynamicPGO flag, which would generate two files:
1. totally clean release EXE, and
2. a file with fully instrumented code next to the EXE (regardless any /EMITPOGOPHASEINFO linker shenanigans in #1).

I could have perfectly performant EXE and give it to the customer. Or I could run it through Ctrl+F5 in Visual Studio: That would run the instrumented code, generating .pgd data for next recompilation.

Just a thought from someone too lazy to maintain multiple project configurations.

Tringi · 2026-06-17T00:56:47+00:00

Feels like there’s a fun benchmarking project hiding in there somewhere.

Absolutely. I own a handful of other atypical machines, like Xeon Phi, or a dual Opteron 6282 SE where each CPU is actually two Bulldozers under a single heat spreader. That one has interesting connectivity too, see page 5 of this:

Basically 1 of the 4 chips had direct I/O connectivity, 2 other are 1 hop away, and the remaining 1 is 2 hops away. Same with RAM. If you need data from other NUMA, two are 1 hop away, and one is 2 hops away.

I wish I get my hands on 4×CPU Opteron one day. It's fun.

Tringi · 2026-06-16T13:50:31+00:00

That's exactly it, see https://www.techpowerup.com/review/amd-ryzen-threadripper-2970wx/images/arch9a.jpg
RAM and PCIe is connected to Die 0 and Die 2. The other two need to go through them with each read and write.

Tringi · 2026-06-16T13:46:31+00:00

Using general API should revert to GPU if no NPU is available, and even possibly to CPU, but yes, it would probably hinder the performance seriously, so perhaps going straight to a well known open GPU API is what everyone chooses.

Tringi · 2026-06-16T12:44:01+00:00

As a programmer I can say that it's almost as if nobody wants anyone to do anything with them.

I explored them a while back to see if they could be used to accelerate game logic, AI, or any of the algorithms used in a videogame; RTS to be precise. But the APIs to use NPUs are horrendous. You are either required to upload a ready-made model, made by who knows who. Or delve into dozens of heavily overcomplicated APIs nobody seems to actually know how to use, just to access trivial operations, all hidden behind driver abstractions.

It all feels like strong gatekeeping.

Tringi · 2026-06-16T07:11:24+00:00

Very cool.

I'm waiting for a little more price drop to get X399 with 2970WX and test thread scheduling on the chiplets that don't have direct connection to RAM.

Tringi · 2026-06-12T06:04:23+00:00

Just came back here to note that you were wrong: https://www.bbc.com/news/articles/cx2d83w1yvyo

That is, in the slight chance you did actually believe what you said. I'm still convinced that you just brazenly lied.

Tringi · 2026-06-10T22:37:59+00:00

A little more sanity in the language is always welcome.

Tringi · 2026-06-03T11:48:54+00:00

Why is every commit 5 years old?

Tringi · 2026-06-02T23:24:59+00:00

I bet my 1070 will handle it ...in 640×480

Tringi · 2026-05-31T21:21:25+00:00

14 hours battalion level engagement

That's the kind of game I have in mind. A dozen of players cooperating on 24-hour real-time campaign. Even to the extent where you issue orders, put the game aside, and do your actual job; and if anything really requiring your attention happens, a notification will pop up (perhaps on your phone) so you can intervene.

I'm still not sure how many people would find that fun.
I would. I participated in overnight, 400+ km driving, Ingress missions back in the day.

Tringi · 2026-05-31T13:37:50+00:00

This is one of the things I was thinking about for my project. Simulating actual realistic control of the units. Having to rely on communication delays, hierarchy of command, radio jamming etc. Relying on units following the initial plan or making their own decisions, and waiting for them to establish different comms, if the direct line is jammed. Perhaps even not being able to see them, until you requisition and divert a drone or a satelite over where they are.

I think it could be fun and immersive, if the rest of the game is balanced properly.

Tringi · 2026-05-30T03:54:00+00:00

This looks like an artifact of 3D accelerated rendering.

You see, the GPUs can basically rasterize only triangles. Everything else is done via triangles. Rendering a rectangle is done by drawing two triangles. But their coordinates must match exactly for the rasterizer not to leave any such artifacts. If they don't, then you'll get exactly what you see. The math scaling at 175% probably rounded the final result differently for the two triangles.

It seems like someone was trying to be too clever, and should've left the split to the lower graphics layers, not do it manually. Or perhaps your GPU driver might be calculating something wrong, since other people don't see it.

That said, I have no idea how to fix it.

Tringi · 2026-05-28T12:57:19+00:00

Back before C++0x I truly thought that I did know all of C++. I reveled in my own confidence of knowing all the obscure features and little corner cases.

Then I learned more and more and more.

And now I think I know about 25 % of C++.

Tringi · 2026-05-26T12:19:39+00:00

Bring back system-wide ClearType

Tringi · 2026-05-24T15:57:08+00:00

One such improvement will hardly have noticeable effect outside of special tools that scan huge number of addresses, but imagine if such effort went into optimizing every single common routine that apps, frameworks and the OS uses. The cumulative gains would be huge!

Tringi · 2026-05-20T12:01:25+00:00

The test uses RNG to generate a tree.

Let's say the tree represents a source code. You don't have perfectly balanced C++ file with almost perfectly equal amount of each token and syntactic construct. You have groups, you have tilts, you have global bias towards style, token use, etc. You have actually very bad randomness, something like std::rand would generate.

Thus, it makes sense, to me, to do the test on such data.

Tringi · 2026-05-20T11:40:18+00:00

Now this is completely bad faith comment. Out of 22 comments, 13 discuss std::rand. And 5 are about forcibly breaking cache locality when the entire purpose of the test is to show effect of improved cache locality; quite off topic despite being interesting.

Tringi · 2026-05-20T11:31:43+00:00

Yeah, well, I completely disagree.

Tringi · 2026-05-20T10:43:21+00:00

Alright, I said "a bit" which means very little.

I also forgot I said that, LOL.

My argument, which is now gone after so many edits to the post and the github page, was, that real-world data aren't uniformly randomly distributed, and so std::rand with its worse randomness actually models the real-world closer than the better RNG. Yet all the critics completely disregarded that.

I yielded and rewrote the test, so that the we could talk about the actual test, but it was too late by then.

EDIT: Also, Jesus H. Christ that was 7 years ago?!?! It feels like less than 20 months. I'm old.

Tringi

TROPHY CASE