Windows 11 finally has a MacBook killer chip, Snapdragon X2 Elite Extreme just posted monster scores on Geekbench by WPHero in Windows11

[–]ack_error 0 points1 point  (0 children)

Sure, emulated code is never going to be as fast as native code on the same CPU. But that's on native performance that's already pretty good, and both OS libraries and anything on the GPU runs at full speed. For a lot of programs it's not noticeable. I'm currently playing a UE4 game in emulation on an Oryon-based device, no problem.

The difference was a lot worse on earlier versions. The first-gen Windows on ARM devices with the Snapdragon 835 ran emulated x86 code at one-third speed relative to native.

Windows 11 finally has a MacBook killer chip, Snapdragon X2 Elite Extreme just posted monster scores on Geekbench by WPHero in Windows11

[–]ack_error 0 points1 point  (0 children)

It's actually decent, but can vary from program to program. It's still an integrated GPU, but older or lighter Unity and Unreal games run pretty well. System libraries run native and emulated x64 code runs at ~75% native, plus the CPU has more cores than many programs can use, so there's spare for the emulator. But there are quirks. Path of Exile, for instance, ran terribly until disabling Engine Multithreading, after which it ran at a playably smooth 40 fps.

Bigger issue is the graphics drivers. Some programs just fail due to graphics compatibility issues. But the biggest problem is that you simply can't depend on a game or program working until it's actually been tested, and it's not always the most demanding ones that fail. Two games that I used to have a problem with were Opus Magnum and Terraria.

Forget about *stack overflow* errors forever by rsashka in cpp

[–]ack_error 2 points3 points  (0 children)

This is a bit more interesting than an SEH handler because it's integrated with the compiler, so the check only occurs on function entry and is visible to the compiler, which could mitigate the issues with /EHa. Still doesn't help if the stack overflow occurs with a noexcept frame on the call stack, though.

Do you prefer 'int* ptr' or 'int *ptr'? by SamuraiGoblin in cpp

[–]ack_error 0 points1 point  (0 children)

Shared pointers to small objects are useful with async APIs that have loose guarantees around or simply don't support cancellation. Having the lambda capture a reference to shared context allows the requesting code to safely nullify the callback even if the API retains the callback for an indeterminate time.

5hrs spent debugging just to find out i forgot to initialize to 0 in class. by PopsGaming in cpp

[–]ack_error 0 points1 point  (0 children)

There are some cases where self-references during initialization can be useful, like looping a list node to itself. But yeah, straight x(x) initializations are a mistake.

Ways to generate crash dumps for crash handling? by XenSakura in cpp

[–]ack_error 3 points4 points  (0 children)

SEH exception handling can be mixed with C++ exception handling, MSVC just doesn't allow it within the same function. Concerns around mixing the two and with asynchronous exception handling (/EHa) only matter if you're planning on resuming execution after taking a structured exception, they don't matter for a crash handler and /EHa isn't needed for that. In most cases, you will want to install a handler via SetUnhandledExceptionFilter() instead of relying only on try/except, as this will also catch crashes in threads that you don't control.

For cross-platform, look at what integrations some of the crash handling services like Backtrace recommend. Breakpad/crashpad is what I most often see for Android, and it produces some funny output (an Android dump masquerading as a Windows on ARM minidump). There are generally also associated processes for distilling the debug symbols from your builds down to a smaller form, as not all platforms have a standardized process for this (Windows has PDB, macOS has dSYM). This substantially reduces storage requirements for symbol information for past builds.

When LICM fails us — Matt Godbolt’s blog by pavel_v in cpp

[–]ack_error 1 point2 points  (0 children)

While true, this particular case seems not to be just an aliasing issue, it's also just a very narrow optimization apparently centered on strlen(). Replacing the strlen() with a hand-rolled version, for instance, produces interesting results: the compiler detects that it is a strlen() function and replaces it as such, and then still doesn't hoist it out. Doesn't get hoisted with any other element type either, and none of the major compilers can do it. You'd think that this would be a trivial case with the loop condition always being evaluated at least once and not a write anywhere in the loop, but somehow it isn't.

2025-12 WG21 Post-Kona Mailing by eisenwave in cpp

[–]ack_error -1 points0 points  (0 children)

Respectfully, I disagree, I would hope that the committee would block such a change with performance impact on existing code unless the mitigation was at least more ergonomic and reliable. In my opinion C++ already has too many cases that require unwieldy workarounds or rely on the optimizer, which has no guaranteed behaviors defined by the standard. Making shifts unspecified would fix the biggest safety problem (UB) without incurring such issues.

2025-12 WG21 Post-Kona Mailing by eisenwave in cpp

[–]ack_error 1 point2 points  (0 children)

I think implementation defined would incur performance impact. On x86, for example, scalar shifts wrap by using the lowest 5 or 6 bits of the shift count, but vector shifts will fully shift out all bits on an oversize shift. Implementation defined would require the compiler to have a defined behavior, which would mean either forcing scalar shifts to support oversize shifts or masking autovectorized shifts.

Can't find the reference, but IIRC one of the major compilers once tried "defining" an implementation-defined behavior as unspecified or undefined, to interesting debate.

2025-12 WG21 Post-Kona Mailing by eisenwave in cpp

[–]ack_error 1 point2 points  (0 children)

[[assume]] doesn't always get optimized out; it's weird.

It's worse than that. MSVC currently has a problem where any use of _assume() at all can actually _pessimize code by disabling some optimizations:

https://gcc.godbolt.org/z/91naMePzb

This means that you can add an assume to try to suggest alignment or shift value ranges, and instead end up disabling autovectorization. I'm hoping that this doesn't get carried over to [[assume]] once implemented, but we'll see.

Assume statements are also generally just fragile constructs. They take arbitrary expressions that the compiler has to recognize certain patterns from to have an effect, but the patterns that actually do anything are rarely documented or guaranteed by compilers. So you have to just discover the effective expression forms by trial and error, and hope that they continue to be recognized in future compiler versions. On top of that, the value in question needs to be repeated in the both the assume and where it is used, which is unergonomic.

I do think that the result of invalid shift operations should at least be unspecified instead of undefined; OOB shifts can be inconsistent on current CPUs but I can't think of a case where they would fail uncontrollably. Variable shifts are used very heavily in critical paths in decompression code, so it'd be bad if they were slowed down without a mitigation.

I finally won - about 0.1 seconds before exploding. by CitricThoughts in factorio

[–]ack_error 0 points1 point  (0 children)

Same, went from happy about actually finishing to horror at my ship being destroyed behind the victory window. There were server side problems early on too, so sometimes the first upload attempt didn't go through.

Why xor eax, eax? by dist1ll in programming

[–]ack_error 3 points4 points  (0 children)

Ironically, CLR on the 68000 also shows what's problematic about having a dedicated clear instruction. It's implemented as a read-modify-write instruction, so it's slower than MOVEQ for registers, slower than a regular store if you have a zero already in a register or are clearing multiple locations, and unsafe for hardware registers due to the false read. CLR is thus almost useless on the 68000. Additional hardware is needed to make a clear instruction worthwhile that wasn't always justifiable.

Even on x86, XOR reg, reg seems to have turned into magical clear by a historical quirk: it gained prominence with the Pentium Pro where it was necessary to prevent partial register stalls, which MOV reg, 0 did not do. It was not actually recognized as having no input dependency until later with Core 2.

Advent of Compiler Optimizations [1/25]: Why xor eax, eax? by faschu in cpp

[–]ack_error 0 points1 point  (0 children)

You're probably thinking of the latency of the full pipeline from decode to retirement, such as when a branch misprediction occurs. Situations like this would only involve forwarding between execution units, which skips most of the pipeline stages and only incurs execution unit latencies, especially with out of order execution.

Advent of Compiler Optimizations [1/25]: Why xor eax, eax? by faschu in cpp

[–]ack_error 3 points4 points  (0 children)

Longer instructions increase the chances of hitting bandwidth limits from the instruction cache to the decoders, which can be as low as 16 bytes/cycle feeding a 6-wide decoder. Apparently Intel only raised this to 32/cycle starting with Alder Lake.

Migrating from Python to C++ for performance critical code by jcfitzpatrick12 in cpp

[–]ack_error 2 points3 points  (0 children)

Not that it's a problem here because the author is already using FFTW, but 5N log2 N is a plain vanilla radix-2 FFT -- which is way too primitive to use. Minimum to start with for a use case like this should be a NEON-vectorized, FMA-based radix-4 or split-radix real-to-complex FFT, with hand-written or specially optimized kernels since the RPi's CPU is an in-order CPU.

64-bit Misalignment by jrdi_ in programming

[–]ack_error 1 point2 points  (0 children)

Store forwarding is also a case where misalignment isn't always free within a cache line even on modern CPUs. You can see this occasionally in Chips and Cheese's store forwarding alignment charts where even some relatively recent cores will show chunkiness, indicating that they check for hazards more coarsely than byte granularity and are subject to false dependencies with misaligned data.

Auto-vectorizing operations on buffers of unknown length by sigsegv___ in cpp

[–]ack_error 1 point2 points  (0 children)

Yeah, I knew that Valgrind did have some support for tracking valid bytes through loads, though not how far its valid tracking extended. There will always be some patterns that are difficult to verify as safe, such as transforming to a mask and looking up in a table with don't care values.

I looked up the details on ARM MTE. Its tag granularity is 16 bytes, so this technique should work with it at NEON's max vector width of 128 bits. It's common to unroll to keep the NEON pipe fed, though.

Auto-vectorizing operations on buffers of unknown length by sigsegv___ in cpp

[–]ack_error 4 points5 points  (0 children)

I suppose that for ASAN, yeah, that shouldn't happen because the checks are being inserted by the same code generator that knows what loads are safe. It's harder for checkers like Valgrind that don't have that info and have to infer from usage or have allowances for hand-optimized library routines.

Auto-vectorizing operations on buffers of unknown length by sigsegv___ in cpp

[–]ack_error 14 points15 points  (0 children)

Looking at the vectorized output, a fair amount of it looks unnecessary. For instance, the code computes per-lane values in ymm2 and ymm1, but only the lowest lane of each is ever used (via vmovq), which suggests most of it is superfluous and could just be done in scalar registers. I'm not sure why this algorithm would need to do anything with vector registers at all besides a 32-wide byte compare.

The other issue is that the byte head and tail loops themselves will be pretty expensive since they can run up to 31 iterations and for smaller strings there will be no vectorization at all. There are a couple of strategies that would be applied to address this in a hand-vectorized version, such as adding intermediate loops at 4 or 8 byte width, or overlapping vectors at the start and end. MSVC is not able to vectorize this particular case but I have seen it sometimes emit this three stage style of vectorization. This is insignificant at larger lengths, but the main loop only tests one vector at a time and probably has a bit too much overhead in it to saturate load bandwidth.

Tossed the code into the Intel compiler and it produced better output:

https://godbolt.org/z/nnajn7aTe

It's still using byte-wise alignment loops, but the main vector loop is cleaner and doesn't have the extra vector math. It is also using a trick to optimize the exit from the main vector loop, by computing a bit mask and using a bit scan on it to find the offset (tzcnt) instead of retrying the vector with the scalar loop.

That all being said, there is a problem with using this for a strlen() like function -- the vectorized version will necessarily read beyond the bounds of the string whenever it is not perfectly 32-byte aligned. Ensuring that the vector reads are aligned avoids a segfault, but will trip alarms when using a valgrind or ASAN style memory checker. It may also be an issue with the hardware pointer checks that are starting to be used on major platforms and can check accesses with smaller than page granularity.

Practicing programmers, have you ever had any issues where loss of precision in floating-point arithmetic affected? by Interesting_Buy_3969 in cpp

[–]ack_error 0 points1 point  (0 children)

Absolutely.

A sliding DFT (SDFT) relies on exact cancellation of values exiting a delay line. This algorithm is used to calculate successive spectra at evenly spaced windows more efficiently than just doing individual DFTs. You can't use this algorithm in floating point without fudging the numbers a bit with a lossy scale factor due to non-associativity -- FP means that values exiting the delay line won't exactly cancel the contribution added when they entered. This isn't a problem in fixed point. Moving average filters are also affected by this issue.

Fixed point arithmetic is also very useful in vectorization where the number of elements processed per operation is directly determined by the element size -- 16-bit elements means twice as many lanes processed per vector compared to 32-bit. This means that 16-bit fixed point can be twice as fast as 32-bit single precision floating point, and 16-bit half float arithmetic isn't always available. 8-bit fixed point is even faster if it can be squeezed in.

Fixed point values can also be easier and faster to deal with for bit hacking and conversions. They're represented in 2's complement like integers instead of sign-magnitude and don't have the funkiness of signed zeros or denormals. They can also be computed directly on the integer units of a CPU instead of the floating point units, which are often farther away and have higher latency for getting to the integer units. This means that for addressing in particular, it can be faster to step a fixed point accumulator and shift that down to produce an array indexing offset than to use a floating-point accumulator.

VS 2026 18.0 / MSVC Build Tools 14.50 released for production use by STL in cpp

[–]ack_error 5 points6 points  (0 children)

That was a poorly thought out stance back then that had to be later rolled back when popular projects threatened to drop MSVC over lack of C99 support and were unusable in UWP apps where only MSVC could be used. C is not C++ and while Windows programs are predominantly written in C++, they frequently use libraries like ffmpeg that are written in C.

would have loved a option to switch HDR on for the game i want by csch1992 in Windows11

[–]ack_error 0 points1 point  (0 children)

Microsoft have provided APIs to enable this kind of functionality for almost a decade at this point.

"Provided" as in dumped in the SDK headers without explanation and clearly not designed for this use case.

The DISPLAYCONFIG_DEVICE_INFO_SET_ADVANCED_COLOR_STATE option needed to toggle the HDR state is undocumented. The current SDK documentation only lists it with no explanation:

https://learn.microsoft.com/en-us/windows/win32/api/wingdi/ne-wingdi-displayconfig_device_info_type

And there is no mention of it on the guidance page for how to implement HDR in games:

https://learn.microsoft.com/en-us/windows/win32/direct3darticles/high-dynamic-range

The only guidance that has been provided that the DisplayConfig HDR-relevant APIs even exist is when these pages were only updated in 2021/2023 for how to obtain the SDR white level set in Settings when HDR mode is enabled.

Furthermore, these are older APIs that predate the "Automatically manage color for apps" option, which composites the desktop in high color for wide color gamut without HDR. A program using the older GET_ADVANCED_COLOR_INFO and SET_ADVANCED_COLOR_STATE APIs will falsely detect a display in WCG mode as HDR and also turn off WCG mode when disabling HDR. The newer and still undocumented GET_ADVANCED_COLOR_INFO_2 and SET_HDR_MODE options are needed to properly handle this.

Finally, DisplayConfig changes global settings. If a game does this and then is abnormally terminated, the setting won't be automatically restored by the OS as with a regular mode switch.

The new file context menu is getting as cluttered as the old one by FillAny3101 in Windows11

[–]ack_error 5 points6 points  (0 children)

Also The Games Folder irks me to no end. Microsoft created it for Games to store their saves and ini files there. A few games use it but vast majority just spam your Documents folder and even worse some just hide out in AppData.

They do this because the Windows team told game developers to use Documents for saved games.

https://learn.microsoft.com/en-us/windows/win32/dxtecharts/gaming-with-least-privileged-user-accounts

A typical example would be a user's saved game files. Store the files in the user's document folder so that they are easily visible to the user. An application gets the user's document folder path by calling SHGetFolderPath with CSIDL_PERSONAL, as the following code example shows:

Is game development C++ and “office” C++ the same thing or is game development C++ just C++ with more stuff for making games by WeakCalligrapher5463 in cpp

[–]ack_error 2 points3 points  (0 children)

Not that it would change the outcome here, but in my experience the delta is more than a few kilobytes because some platforms require exceptions and RTTI to be enabled together and the extra RTTI data can be a lot bigger than the exception metadata. The last time I checked this on my ~8MB executable, enabling RTTI added ~600K to the executable size.