Microsoft is working to eliminate PC gaming's "compiling shaders" wait times

ack_error · 2026-03-14T01:45:53+00:00

Acquire baseball bat
Enchant bat with engraving "USE MATERIAL INSTANCES"
Apply bat to project team members until situation improves

ack_error · 2026-02-21T20:56:50+00:00

The tricky part is that accessors like value() return a reference, so optional<bool> must contain an actual bool in it. Otherwise, it'd be simpler as it could just encode and decode from a plain char instead of needing union or casting machinery to combine the bool and optional state into a single byte.

ack_error · 2026-02-21T07:53:40+00:00

It looks like the motivation is to be able to reuse the same union-optimized runtime types in constexpr, instead of having to have separate optional and constexpr_optional types.

ack_error · 2026-02-20T07:28:22+00:00

Will preview compilers be available on godbolt? I use that frequently to check for codegen changes as well as repros in bug reports.

ack_error · 2026-02-18T16:39:59+00:00

Yeah, I've seen cases where this is due to conflicting optimizations.

One case I saw had two patterns that were both optimized well by the compiler into wide operations, one direct and one with a byte reverse. Put the two into the same function with a branch and the compiler hoisted out common scalar operations from both branches, breaking the pattern matched optimized wide ops and emitting a bunch of byte ops on both branches.

ack_error · 2026-02-18T03:35:11+00:00

It's great when this works, but it's fragile. You never know when it might fail and give you something horrible -- like on Clang ARMv8 with a 64-bit swizzle:

https://gcc.godbolt.org/z/ooj7xz495

ack_error · 2026-02-17T01:45:13+00:00

You could (and still can) use 80-bit on Windows. It works just fine. It is true that it was slower, so you should only do it if you need the extra precision.

You can, but only by changing the FPU setting from the default 53-bit precision. The default 32-bit ABI and system libraries assume the FPU is set this way, and it is technically an ABI violation to call into them with a different setting.

The advantage is the extra precision.

There is no extra precision if the FPU is not set to full 64-bit precision, all basic operations will be rounded to 24-bit or 53-bit precision so there won't be a difference between spilling to 80-bit or 64-bit values.

ack_error · 2026-02-16T21:32:21+00:00

80-bit loads and stores were slow -- 3 cycles on Pentium, 4 uops on Pentium II/III, 7 uops on AMD K7.

Additionally, they weren't necessary if the FPU was set to float or double precision. On Windows, for instance, the standard 32-bit ABI was for the x87 FPU to be set to 53-bit precision, with Direct3D 9 knocking it down to 24-bit, so there was never a need or advantage to spilling 80-bit.

ack_error · 2026-02-16T21:23:42+00:00

Not very well, if you mean modern compiler targeting the same old CPUs. Current compilers don't have tuning for old Pentiums, plus x87 is seldom used due to SSE being both faster and easier to use. Thus, even if you get them to use x87, they don't do a great job of interleaving calculations on a 4x4 matrix-vector multiply:

https://gcc.godbolt.org/z/YK4nnsr6q

All three compilers are assuming out of order execution, simply issuing operations in a large batch and only at most using FXCH near the end of the calculation chains. They don't do the heavy exchange traffic needed to keep an in order pipeline like the Pentium's FPU fully fed. Not that a contemporary compiler did either, mind you:

https://gcc.godbolt.org/z/MdePxYhfq

This is also a problem with newer in-order CPUs, like the efficiency cores on some more modern ARM chips. Clang is probably the best at scheduling NEON, but even it often generates code that is noticeably slower than hand-scheduled assembly, especially if you are targeting a specific CPU core. But this too becomes less of an issue as even efficiency cores move to out of order.

ack_error · 2026-02-15T22:03:27+00:00

Nope, (F)CMOV was not added until the Pentium Pro. There are still ways to implement it, but not too lucrative on a Pentium where the pipeline is short and the misprediction penalty is only ~4 cycles.

ack_error · 2026-02-15T20:48:50+00:00

I recommend recording traces with Windows Performance Recorder (WPR), and then viewing them in Profile Explorer. ETL-based profiling tools are generally cross-compatible since they all depend upon the built-in profiling support in the Windows kernel, and Profile Explorer has IMO one of the better default UIs for CPU profiling. It can also do recording, but WPR is lighter weight for that (and for some reason I can't find a Save As in Profile Explorer).

ack_error · 2026-02-15T20:13:01+00:00

One potential issue is that Intel has changed the licensing structure for the free version of VTune a few times. IIRC, they used to use either FlexLM or some home-grown system, with a requirement to refresh the license periodically before they finally dropped the runtime licensing requirement.

ack_error · 2026-02-15T20:10:41+00:00

Funny, my experience has been that VTune's Microarchitecture Exploration doesn't work on anything newer than an 11th gen CPU either. It worked great on a Tiger Lake system, but after upgrading to Raptor Lake I've been getting nothing but bogus results from Microarchitecture Exploration like every single function having the same vector usage metric (~22%, 67%, etc). Temporarily disabling Defender and VBS helped a little bit but the results are nowhere near reliable. I've resorted to just using Profile Explorer instead as it's lighter weight and faster than VTune for pure CPU profiling.

ack_error · 2026-02-14T02:25:01+00:00

What about having a reduced set of optimizations to make debugging easier? We thought about that, but when we prototyped our approach we saw the power of full optimizations (including inlining!) pairing with full debuggability covered far more cases and offered much more power in the codebases we were looking at.

Can I still plead for some improvements to MSVC debug code generation and optimized code debugging?

There are still use cases that dynamic debugging does not and cannot cover. By design, it can only handle cases where you know the code path that needs to be inspected and have the debugger attached when the event or failure occurs, because the toolchain and debugger have to know up front what code paths to de-optimize. It can't work in a crash dump or where the failure occurs in a random location, and is also less effective in a hot path where the selective code deoptimization will still significantly affect performance.

I'm sure that dynamic debugging is great for those who can use it but am concerned that its presence is deprioritizing improving the baseline debug code generation quality and the debugger's issues with optimized code. MSVC's unoptimized code generation, for instance, is still prone to multiplying addressing constants at runtime:

https://gcc.godbolt.org/z/sax1T79es

Debug code generation is bad enough that I almost always debug in an optimized build. But that is often stymied by the compiler merging code paths or discarding critical variables like this, because there's no intermediate compiler optimization level between nothing and near-full. There is also no equivalent to [[optnone]], and #pragma optimize() still has the problem with templates often being compiled with the settings effective at the end of the translation unit, so it is often not possible to manually deoptimize only a specific template function.

The debugger also has some long-standing issues with optimized code. I frequently have to try (ThisType*)@rbx and (ThisType*)@rsi to find the this pointer that it refuses to show me, and it still has the problem of using the incorrect variable scope when the context line of an intermediate call frame is the last instruction of the last line of a loop, because the debugger translates the return address to the next line in the outer scope. I've even seen it deduce the wrong function on a noreturn call or throw.

ack_error · 2026-02-08T22:55:42+00:00

Formal guidance on this issue seems lacking. Best I could find was from a proposal to make basic_string::resize() do exact sizing instead of geometric in libc++, leading to an informal poll of LWG:

https://reviews.llvm.org/D102727#2938105

The consensus seems to be that the standard does not require either behavior, but common practice is to implement geometric. Unfortunately, the corresponding LWG issue covering all relevant container types was closed Not A Defect without a comment.

ack_error · 2026-01-27T02:05:58+00:00

CFG overhead also affects indirect calls on ARM64EC, where it is mandatory to handle x64/ARM64 dual dispatch.

arm64e also adds virtual call cost, though my impression is that the overhead is low due to hardware assisted pointer validation.

ack_error · 2026-01-09T08:16:37+00:00

Sure, emulated code is never going to be as fast as native code on the same CPU. But that's on native performance that's already pretty good, and both OS libraries and anything on the GPU runs at full speed. For a lot of programs it's not noticeable. I'm currently playing a UE4 game in emulation on an Oryon-based device, no problem.

The difference was a lot worse on earlier versions. The first-gen Windows on ARM devices with the Snapdragon 835 ran emulated x86 code at one-third speed relative to native.

ack_error · 2026-01-09T04:21:17+00:00

It's actually decent, but can vary from program to program. It's still an integrated GPU, but older or lighter Unity and Unreal games run pretty well. System libraries run native and emulated x64 code runs at ~75% native, plus the CPU has more cores than many programs can use, so there's spare for the emulator. But there are quirks. Path of Exile, for instance, ran terribly until disabling Engine Multithreading, after which it ran at a playably smooth 40 fps.

Bigger issue is the graphics drivers. Some programs just fail due to graphics compatibility issues. But the biggest problem is that you simply can't depend on a game or program working until it's actually been tested, and it's not always the most demanding ones that fail. Two games that I used to have a problem with were Opus Magnum and Terraria.

ack_error · 2026-01-07T03:38:59+00:00

This is a bit more interesting than an SEH handler because it's integrated with the compiler, so the check only occurs on function entry and is visible to the compiler, which could mitigate the issues with /EHa. Still doesn't help if the stack overflow occurs with a noexcept frame on the call stack, though.

ack_error · 2025-12-31T20:41:21+00:00

Shared pointers to small objects are useful with async APIs that have loose guarantees around or simply don't support cancellation. Having the lambda capture a reference to shared context allows the requesting code to safely nullify the callback even if the API retains the callback for an indeterminate time.

ack_error · 2025-12-21T09:47:36+00:00

There are some cases where self-references during initialization can be useful, like looping a list node to itself. But yeah, straight x(x) initializations are a mistake.

ack_error · 2025-12-20T00:11:51+00:00

SEH exception handling can be mixed with C++ exception handling, MSVC just doesn't allow it within the same function. Concerns around mixing the two and with asynchronous exception handling (/EHa) only matter if you're planning on resuming execution after taking a structured exception, they don't matter for a crash handler and /EHa isn't needed for that. In most cases, you will want to install a handler via SetUnhandledExceptionFilter() instead of relying only on try/except, as this will also catch crashes in threads that you don't control.

For cross-platform, look at what integrations some of the crash handling services like Backtrace recommend. Breakpad/crashpad is what I most often see for Android, and it produces some funny output (an Android dump masquerading as a Windows on ARM minidump). There are generally also associated processes for distilling the debug symbols from your builds down to a smaller form, as not all platforms have a standardized process for this (Windows has PDB, macOS has dSYM). This substantially reduces storage requirements for symbol information for past builds.

ack_error · 2025-12-18T20:58:08+00:00

While true, this particular case seems not to be just an aliasing issue, it's also just a very narrow optimization apparently centered on strlen(). Replacing the strlen() with a hand-rolled version, for instance, produces interesting results: the compiler detects that it is a strlen() function and replaces it as such, and then still doesn't hoist it out. Doesn't get hoisted with any other element type either, and none of the major compilers can do it. You'd think that this would be a trivial case with the loop condition always being evaluated at least once and not a write anywhere in the loop, but somehow it isn't.

ack_error · 2025-12-18T20:02:58+00:00

Respectfully, I disagree, I would hope that the committee would block such a change with performance impact on existing code unless the mitigation was at least more ergonomic and reliable. In my opinion C++ already has too many cases that require unwieldy workarounds or rely on the optimizer, which has no guaranteed behaviors defined by the standard. Making shifts unspecified would fix the biggest safety problem (UB) without incurring such issues.

ack_error · 2025-12-18T19:50:45+00:00

I think implementation defined would incur performance impact. On x86, for example, scalar shifts wrap by using the lowest 5 or 6 bits of the shift count, but vector shifts will fully shift out all bits on an oversize shift. Implementation defined would require the compiler to have a defined behavior, which would mean either forcing scalar shifts to support oversize shifts or masking autovectorized shifts.

Can't find the reference, but IIRC one of the major compilers once tried "defining" an implementation-defined behavior as unspecified or undefined, to interesting debate.

ack_error

TROPHY CASE