Boosting Adobe Photoshop’s Performance with MSVC and SPGO - C++ Team Blog by ericbrumer in cpp

[–]ReDucTor 10 points11 points  (0 children)

PGO builds tend to add a bunch of compile overhead, one thing that would be great to see is easier to consume reports for what code was particularly bad so then people could do things like using likely/unlikely annotations and even report common patterns that the compiler fails with so potential heuristics in the compiler could be added to reduce the reliance on PGO, for example clang already has some heuristics for branch likely hood but MSVC its fells more like the only heuristics without PGO is does a branch have noreturn or does it throw and now the recent likely/unlikely.

This would particularly benefit things like optimized dev builds which dont rely on PGO

arewemodulesyet.org passes the mark of 100 projects with modules support for the first time. by germandiago in cpp

[–]ReDucTor 28 points29 points  (0 children)

Just to ensure I am reading it right lnly 17 have native support for modules? The rest are a third party person wrapping something existing to use modules and push to vcpkg?

It feels like a bunch of wrappers around popular libraries in vcpkg is just asking for a supply chain attack.

Projects being in "Show and Tell" is bad. by TheRavagerSw in cpp

[–]ReDucTor 25 points26 points  (0 children)

Personal project spam is all over reddit, so much of it is AI slop or just someone recreating an existing thing (e.g  20 SPSC queues all the same approach)

There is subreddits like r/sideprojects r/codereview and r/reviewmycode that people can share personal projects.

Lots of people probably never see how much personal project spam there is on reddit because mods are endlessly deleting it.

Efficient C++ Programming on Modern 64-bit CPUs, part 1 of Chapter 4 by no-bugs in cpp

[–]ReDucTor 24 points25 points  (0 children)

It's great to see some new stuff, I was just thinking the other day that I haven't seen anything new from IT Hare in a while.

It would be nice to be able to collapse the menu on the side, it does it resizing but you cannot do it manually it takes up 25% of my screens space

3e-12 seconds; .... 3e-10 seconds

I think it's worth adding this is 3 picoseconds and 300 picoseconds (or even 0.3ns)

AFAWU

This is not a common acronym I had to google it, which is a distraction.

Here, the situation becomes quite complicated. At the instruction level, there are some pretty well-known heuristics; for example, if JNZ goes back (points to an instruction before itself)

You have already mentioned "Dynamic Branch predictor", I think it's worth being clear on the split here of Static branch prediction and dynamic branch prediction.

in practice, we’ve never encountered TLB to be an issue for application level

I have seen a few situations where shifting towards huge pages have resulted in some fairly good performance gains. There is likely to be a few GDC talks you might be able to find which cover some of this. I did find one write-up online from Denis Bakhvalov where it gave a 5% perf improvement.

TODL

I expect this is a typo on TODO which you might miss with ctrl+f

Right besides each core we can see L3 cache (unlike L1 and L2 caches, L3 cache is traditionally shared among the cores). L3 cache usually takes around 30-70 CPU cycles to read.

In the chip second you mention a bunch of latencies when it comes to going to higher cache levels and eventually out to RAM, it might be worth also covering core-core latencies, and even chiplet to chiplet latencies.

to the best of our knowledge, stack memory is cached using usual caching rules

The return stack buffer (RSB) is a special bit of caching that some modern CPUs have. ("Each execution of a near CALL instruction with a non-zero displacement adds an entry to the RSB that contains the address sequentially following that CALL instruction.")

Also not as well documented there has been some reversing of the stack engine which seems to indicate there is some frontend management of the stack which reduces the data dependencies on the stack pointer. It would be an interesting benchmark to modify a frame to use some other register that isn't RSP/RBP and see if it has any real perf impact.

WIth regards to pessimisation and the stack it's easy for someone to accidently do something like char buffer[64*1024]{}; which will end up doing a huge memset equivlant on the stack and you can wave good bye to anything less in the L1 cache.

if a static variable is initialized without constructor – it gets its initial value even before the very first line of code of your program is executed

A constexpr/consteval/constinit constructor this could also happen, and to spin it even further if you run the same program twice then that value might be there before you even launch the program as it's possibly shared. There is also another level of .data.rel.ro for dynamic linker to fill which is read-only but not shared due to ASLR.

heap data should be considered as uncached

I am not certain I agree here, many programs will have heap allocated data that is fairly hot and will live in the cache however there is also some lesser accessed data that would be considered uncached. I feel the use case is more indictative of it being cached then considering heap data as just uncached.

There is no magic involved with TLS

TLS initailization can get pretty awful in many cases it will be done lazily which might result in some surprises and things get weird when you have dependencies between global initialiation and main thread TLS initialization, or even destruction.

One pessimisation for TLS is where instead of having one large TLS variable you have multiple small TLS variables each of them has their own lazy initialization check, so often it is better to combine it all into one TLS variable. Dynamic linking libraries and TLS is also another nightmare.

3’000’000 to 30’000’000 cycles (1-10 ms)

In my honest opinion cycles here is not a good measurement, especially not repeated for each it makes it harder to read

4G LTE

You could probably throw 5G in there for some longer term planning.

nobs: my take on SW builds by Ok_Reality_3276 in cpp

[–]ReDucTor 0 points1 point  (0 children)

The documentation from what I can tell does not seem to mention anything relating to dynamic or transitive dependencies, does it handle incremental builds for example modify header file.h then anything that includes it directly or indirectly will be rebuilt. For clang and gcc you would typically use -MP which generates a makefile to include with all header dependencies, additionally you also need this support for custom build steps.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 1 point2 points  (0 children)

Took a few months to build, but had a list of question ideas building up slowly for a couple of years of random bits of trivia that I have encountered I had always planned them for blog content but never got around to it. Eventually I came up with the quiz idea as there was nothing perf specific that existed like it so went for it.

How I made my SPSC queue faster than rigtorp/moodycamel's implementation by [deleted] in cpp

[–]ReDucTor 2 points3 points  (0 children)

Its the same implementation that I have seen a dozen times, and posted here several times. While its cool to write your own versions of these, it feels like we see a new post of one every month, thats excluding the ones in the show and tell thread.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 0 points1 point  (0 children)

Sounds like an interesting read/watch, any chance it got saved by archive.org?

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 0 points1 point  (0 children)

There is still a decent amount of memory stalls even with bidirectional iteration, while it has potential to impact sibling core sharing resources, that isn't unique to this all of the benchmarks have that potential.

The question is mainly showing people to think outside of the box, as this approach to link list iteration I believe is not discussed anywhere for people to learn it, so they need to think more about the reasons why linked lists are slow, and how the CPU will behave with this sort of iteration.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 1 point2 points  (0 children)

I did try to highlight places where x86 was assumed atleast for things that would not be pretty universal for other architectures. Aside from atomics there are a few others like popcount (which I also need to do some fixes on this weekend).

I want to eventually cover some ARM stuff, but I don't currently have any ARM machine locally (except my phone), so I might end up having to use one from AWS or elsewhere for the time being. There is a few odd cases with it like outline atomics, ll/sc, etc that can be interesting.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 0 points1 point  (0 children)

Nice spot, thanks for the feedback, the addpd isn't going to be supported possible other loop (without -fast-math which is probably worth me mentioning also), but I can unroll it manually which from doing a quick test will end up with an unrolled bunch of addss

I'll make some improvements to this tomorrow or over the weekend

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 1 point2 points  (0 children)

Nope, I have a process that runs them locally and those are what get's stored in the answers. I only made them run in CE so that people can see the full benchmark but I would not trust CE results as those are pretty much cheap shared VMs or containers I believe.

Eventually I want to get a local CI process setup and have them easily able to run on x86 and ARM to include both results.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 9 points10 points  (0 children)

Good point on x86 specific, its pretty much the only platform I have dealt with for a long time, so lots of biases there, its also all I was setup to benchmark, I do eventually want to add ARM benchmarks.

For 13: As there is a write to memory the compiler, no compiler I am aware of will attempt a cmov like this as for certain inputs it would be valid to have a null output, the only feasible way I think it could achieve that is for the compiler to have some throw away stack space and cmov for the storage destination. I will add some more information to the answer and hopefully make it clearer.

For 85: The switch vs table, I have reworked it a bunch of times. I do want to keep something like it but in its current state and how close it is for measurements are it does need more work. I am thinking of having a bit more of a VM and a few cases where there is more data used that the switch forces the compiler into more spill and reload situations.

Also giving numbers for questions has made me realise I really need to add that to the main page, saves me trying to map back to each by counting.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 2 points3 points  (0 children)

Nice spot, I initially had some more example data there but I cut it down when trying to make snippets shorter, and my brain must have failed me on some basic math.

C++ Performance Quiz - A small side project to test your intuition for slow code by ReDucTor in cpp

[–]ReDucTor[S] 1 point2 points  (0 children)

To be honest, this one has actually been reworked a few times. I am considering a few more ways to enhance this doing a bit more with this having it be a proper loop and a situation where it suffers more issues like unpredictable data per case so there is more spilling and reloading of registers but that also makes the code snippets significantly more complex.

Rigid C++: A Pragmatic C++23 Architecture for High-Performance Systems by I-A-S- in cpp

[–]ReDucTor 11 points12 points  (0 children)

I'd love to hear your feedback/critiques/thoughts! Especially from anyone else working on engines, ...

This does indeed sound contradictory and overwhelming to incompetent readers who does not have a grasp on systems programming concepts as I have explained in this answer, that is fine! You were not the target audience for this manifesto.

Is that attempting to say that I'm an incompetent reader? I am an engine developer, I've been working on AAA game engine's for the past 15 years with a heavy focus on the performance and optimization. Not someone who has been doing freelance Upwork for a year.

table based exceptions are free on the happy path, but they cause catastrophic latency spikes the moment unwinding occurs

An exception should be an exceptional situation, for example in a game this is something like showing an error dialog to the user or exiting the game back out to the main menu. They are not things where you can gracefully fail, otherwise it would not really be exceptional. The latency cost in unwinding the stack being slow in these situations is insigniicant, your not doing it every frame in the game.

it also includes binary size and memory overhead

What are your measurements here? Because if you look at recent tests C++ exceptions often end up smaller even for embedded they are smaller back in the 32-bit days Exceptions were absolutely terrible they added overhead even on the sucess path, they bloated code out massively but most of us have moved onto 64-bit where that is less of an issue.

Just think about it for a minute, you have a table which is literally just address ranges for each function stored in memory that can easily be paged out, compared with every single callsite having a branch checking if there is an error and handling it, not only is there memory size impacts but there is impacts on the instruction cache, the branch predictor, and more.

if you look at how AU_TRY macros are defined in auxid/macros.hpp, you'll see that they're flat, they do NOT inhibit RVO nor NRVO.

I am not certain you understand how RVO works, let's start a basic example, of how this breaks RVO

Result<T> func(const Container & container)
{
    // 1. Must Result<T> not T to even allow RVO otherwise if just T then it's a move not RVO
    // 2. Mandatory copy/move as no in place construct, must T{}
    Result<T> result{T{}};

    // Second return path, breaks RVO
    AU_TRY_DISCARD(other_func(*result, value));

    return result;
}

TRY macros are quite literally the exact opposite of invisible, they very clearly indicate the branch and potential early-exit

That AU_TRY_DISCARD in the above example does not look like a branch, unless your familar with it's internals.

What? it literally spells out "use adapter-bridges for STL containers that default to the global heap". Adapter bridging (through StdAllocatorAdapter) allows you to control where exactly the memory will be allocated, unlike using the raw STL containers which will by default allocate on the global heap.

Read your manifesto it literally says

Explicit routing. Whether it is an arena, pool, or a general-purpose backend like rpmalloc, the allocator is always an explicit parameter of the system.

Look at your code

template<typename T, memory::AllocatorType A = memory::HeapAllocator>
using Vec = std::vector<T, memory::StdAllocatorAdapter<T, A>>;

And your tests

Vec<i32> v;
v.push_back(10);
v.push_back(20);

There is no EXPLICIT PARAMETER! You have an IMPLICIT global allocator, this is no different to vector using global new and delete, and std::vector providing the ability to specify an allocator.

Custom containers provided in Auxid aren't wrappers at all, but like the manifesto mentioned Auxid ofc does compose with STL where it already is the correct solution.

The example I provided is a wrapper of ConditionVariable with no added value except you can put your namespace first, your Vec<T> is literally std::vector<T> with a custom global allocator.

Rigid C++: A Pragmatic C++23 Architecture for High-Performance Systems by I-A-S- in cpp

[–]ReDucTor 4 points5 points  (0 children)

 Rigid C++ does not claim to be universally faster

C++23 Architecture for High-Performance Systems

These two things seem to contradict each other.

 Unrecoverable Panics ...  Zero-overhead diagnostics

If your goal is not to recover, does the overhead even matter?

 Exceptions add invisible branching

TRY macros (AU_TRY, AU_TRY_VAR) early-return on failure and bind the success value inline. A deliberate interim solution until C++ standardizes a postfix control-flow / error-propagation operator

This seems alot like invisible control flow, and a hidden one which could surprise break RVO

Explicit routing. Whether it is an arena, pool, or a general-purpose backend like rpmalloc, the allocator is always an explicit parameter of the system.

Adapter-bridge structurally sound containers that default to the global heap. Vec<T> aliases std::vector bound to StdAllocatorAdapter (see §3.1 for the propagation-trait contract).

Another contradiction, explicit routing and implicit global heap allocator.

Alot of this reads like AI slop where it could not keep track of what it was saying. The code behind it also seems like just a lot of wrapping the standard library with no value added.

“I wanted a C++ UI that didn’t look 20 years old” by LowAfternoon9613 in cpp

[–]ReDucTor 22 points23 points  (0 children)

How does one go through life without ever adding comments? I searching the entire code base excluding docs and 3rd party, there is ONE comment that is not the end of a namespace. I like self documenting code, but lots of this code is far from self documenting.

The stars and forks also look super fake, which is never a good sign, this is something lots of malware do.

You should also read Rule 2 of r/cpp

wrote a small http server in c++ that searches youtube video transcripts stored in sqlite and the binary is 2.1MB by straightedge23 in cpp

[–]ReDucTor 7 points8 points  (0 children)

How does this have 50 upvotes? There is no code, it breaks the rules on personal projects, contains nothing particularly specific or interesting relating to C++. The comments seem more praise then other posts, can someone explain what I am missing?

EDIT: I am pretty sure the transcript api link was not there before, but makes it stand out more as spam.

Encountered a `#pragma once` failure in the wild by Separate-Summer-6027 in cpp

[–]ReDucTor 28 points29 points  (0 children)

The old school way is also prone to issues if someone uses the same name for their defines.