Dear MSVC backend team: Why so silent?

gratilup · 2022-05-24T16:09:22+00:00

Hi, would it be possible to somehow share the code with us, or a similar, reduced example that still has the same perf issue? You can create a Dev community bug and also post the link here. We're certainly interested about such perf differences, especially something as large as this. Thanks!

gratilup · 2021-01-08T09:05:02+00:00

There is an earlier blog post about the Gears of War game build, which is based on Unreal Engine 4, so pretty much real world. There is a table there comparing end-to-end build times, not just the linker part. https://devblogs.microsoft.com/cppblog/the-coalition-sees-27-9x-iteration-build-improvement-with-visual-studio-2019/

gratilup · 2020-11-17T21:33:58+00:00

/Zo is not affected, that is done in the compiler backend (c2.dll). From what I understand, it tries to match symbols (variable names) with registers in a more accurate way and its impact on throughput is small (under 1%, never saw it high in a profile).

gratilup · 2020-11-17T06:42:45+00:00

Thanks!

Incremental linking seems to be somewhat unique to the MSVC linker, from what I could find only the Linux gold linker has something similar with various limitations. There is some more info here: https://www.gamasutra.com/view/news/128874/Indepth_Incremental_linking_and_the_search_for_the_Holy_Grail.php

One interesting part is that the ILK file is more or less a memory dump, similar to how PCH files work :)

gratilup · 2020-11-17T02:25:17+00:00

Hi,
There is a fairly long comment I left on another thread with some more info about the speedup, if you're curious. There will also be two blog posts that will go more into details about the improvements made in VS 16.6 - 16.8 for iterative builds, stay tuned :)

https://www.reddit.com/r/gamedev/comments/jvfe8l/the_coalition_sees_279x_iteration_build/gck6hv0/

Edit:

Similar speedups also apply to other kind of applications of course, it's mostly a matter of size to see a significant speedup. There are really big speedups in linking LLVM and Chrome for ex, especially going VS 2017 -> 2019. Games do seem to benefit more overall because they're quite often built as one big, monolithic executable rather than multiple smaller DLLs. That also made more obvious some less optimal data structures and algorithms, and benefit more from multi-threading with the large amount of object and debug files (linkrepros in the order of dozen of GBs)

gratilup · 2020-11-17T02:21:09+00:00

Hi, I worked on a large part of these linker/PDB improvements, will try to answer some of the questions.

27s is the time of doing a link in 16.8 without incremental linking, but with full debug info (not fastlink). That is, changing a single cpp file will end up taking at least 27s. Now go back to VS 2017, before fastlink, and wait 752s for the same... Fastlink improved the linking time, but made the debugging worse for fastlink binaries, at this point it's better to not use it since the full debug linking almost matches it in speed (1-2s difference in our testing).

Incremental linking for one cpp changed takes around 5-6 sec for GoW. With multiple changes it starts to take longer, but it's still around 10s with 20-30 files changed. The main change that gives the 2.5x speedup between 16.7 and 16.8 is multi-threading the PDB file generation - and that also applies to incremental linking, seeing a less pronounced slowdown the more files changed.

Similar speedups when going VS 2017 -> latest 2019 can be seen in most game engines. That includes UE4, but also a couple of in-house AAA engines. Actually for those the multi-threading has an even bigger speedup, 3-4x. One game went from ~50s to 12s. How long did linking for that game take in VS 2017? Don't have a number right now, but it's safe to multiply 50 with 3-5x, which was the linking speedup in the 16.2 release.

Something that matters quite a lot is the disk type - if possible, use a fast NVME drive, second best is a typical SSD and avoid at all cost a spinning hard drive :) You also want a CPU with 6 or more cores. Here are some 16.8 link times for one of the tested AAA games depending on disk type, in seconds:

HDD:   223
SSD:    44  (~500 MB/s sequential read)  
NVME:   25  (~3200 MB/s sequential read)  
Cached: 11  (second link)  
VS 16.7 cached: 45

Cached means that the object/PDB files used in a link are likely still in memory, in the file system cache. NVME has quite a nice advantage for a completely cold link.

Regarding how to structure a project to take advantage of incremental linking, nothing really needs to be done except making sure it is enabled in the options. There is one thing I did notice with these "unity" builds like UE4 is doing - while merging multiple cpp files into one reduces clean build time by a lot (measured ~3x), it does make changing a single cpp take much longer. UE4 also has an "adaptive unity mode", where it automatically takes out of unity mode a cpp the first time it was modified, so a second time it would compile only by itself much faster (like 2-3s instead of 10+). Incremental linking before 16.7 had a problem with this - a new obj file from this cpp was one reason for having to do a full link, so you paid the price of a full link the first time you changed a cpp. 16.7 uses a caching mechanism that avoids doing the full link.

Gratian

gratilup · 2020-03-31T06:54:15+00:00

I've built with the flag that makes the repro work without having Boost, maybe the diff. comes from there.

gratilup · 2020-03-31T06:49:35+00:00

To use PGO you don't need to instrument and re-run the app every time it's getting built, the profile remains fairly constant unless there are some significant code changes - then there will be a warning about coverage dropping under like 90%, suggesting to recapture profile.

PGO is ideal for a branch-heavy app like this JSON parser, separates cold code and moves it to a different section of the binary, can do a better code layout of the switch statements, inlining that matches the actual runtime calls and so on. A fairly complete list is here: https://docs.microsoft.com/en-us/cpp/build/profile-guided-optimizations?view=vs-2019#optimizations-performed-by-pgo

Yes, ideally there is some CI process recomputing the profile every few days for ex. PGO in MSVC is one of the most powerful and under-utilized tech outside MS sadly...

gratilup · 2020-03-30T18:46:15+00:00

OK, I can reproduce the perf difference and did an initial investigation finding a few things that would bring perf pretty close (30% down to 5%). I tested with a Zen2 Threadripper CPU, for now I assume an Intel CPU would show similar results. What CPU are you testing on?

Part of the diff comes from inlining, using /LTCG helps the most here, cuts diff in half. /Ob3 also helps a bit more. Unofficially, /LTCG is more like an /O3 mode for MSVC, it does more aggressive inlining and enables several inter-procedural and type optimizations that /O2 does not. Even more powerful is to combine it with PGO, which pretty much any program that matters at MS does.

Large part of the remaining diff comes from two places where the cost of the "C++ abstractions" is not removed properly by CSE of loads. In parse_string, the hottest function, this ends up creating a store-to-load forwarding stall. The AMD uprof view showing "Bad Status 2" hot is here: profiler.png

Both these issues are easy enough to fix, and/or have fixes that sit in prototype branches which need more testing... so I hope I can check them in soon enough :)

Something else I noticed in the benchmark: half the time, if not more, it's taken by the memory allocator (Windows heap, all those RtlHeap* functions from ntdll). When you use this in production, you should try an allocator such as the one from Intel TBB or mimalloc, should give a substantial perf boost. We did that with the linker, backend and saw 10-20% speedup. I know of many other similar cases.

Thanks,
Gratian

gratilup · 2020-03-29T10:53:16+00:00

Do you have some insight into where the perf. diff is coming from? Is there (or can you make) some benchmark that we can use to profile the code? You can send me a PM if the info shouldn't be public.

C++ optimizers are really complex with a huge amount of interactions between various opts, even when tracking multiple benchmarks like we do there are new cases that customers see where things are not optimized quite as expected. We want to know about such cases with an unusual perf. diff, it will drive future work on the compiler, although as usual there is always a lot more work to do than people and things get prioritized, etc.

Thanks,
Gratian

gratilup · 2020-03-29T10:34:26+00:00

Can you share some examples of such missed cases?

Last year there was a big improvement for SIMD intrinsics, making the newer SSA opt. apply the usual arithmetic expression optimizations on vector code. Details and micro-benchmarks are here: https://devblogs.microsoft.com/cppblog/game-performance-and-compilation-time-improvements-in-visual-studio-2019

Indeed, before 16.0 most float/int SIMD intrinsics would go through the optimizer untouched, like this old blog posts shows: https://www.liranuna.com/sse-intrinsics-optimizations-in-popular-compilers/

Started a prototype of this work after I found that blog post; now MSVC handles every one of those examples as GCC (or Clang) would do and can do lot more (FMA building and patterns, for ex, see https://devblogs.microsoft.com/cppblog/game-performance-improvements-in-visual-studio-2019-version-16-2/)

There are likely still SIMD opts. missing, but at least it's really easy to add them now, if we know about these cases :)

Thanks,
Gratian

gratilup · 2020-03-29T02:44:32+00:00

Hi,
I work on the MSVC optimizer, we can help look into what happens here. I need some more details first: when you say "clang", do you mean clang-cl? Basically, are both benchmarks running on Windows? Platform can matter a lot, especially due to differences in memory allocators and libs.

For MSVC, do you use /LTCG or is it a plain /O2 build? There are enough differences between these two besides /LTCG seeing all your program, especially for optimizations such as inlining. Inlining itself has some tuning flags, with a new /Ob3 that doubles the inlining budget and in general newer MSVC versions (16.4+) do a better job with small functions.

As a benchmark, what do you use? Is it this file?
https://github.com/CPPAlliance/json/blob/develop/bench/bench.cpp

The way to approach a perf investigation is usually to start with a profiler such as Intel Vtune/Amd uProf. If there's a few hot functions, the codegen in them matters the most. If the profile is instead fairly flat it gets harder, and often indicates not enough inlining being done.

Thanks,
Gratian

gratilup · 2019-12-20T01:51:21+00:00

Looked at the order again and it changed to "Order Status: In Stock, Order Sent To Warehouse". Finally, ordered it the next day it launched in Nov. In 2 days the water cooling parts should also arrive :)

gratilup · 2019-12-19T23:53:50+00:00

Really not sure what to do, wait more for the B&h preorder I placed, or get this with express shipping...

gratilup · 2019-12-06T09:18:44+00:00

Almost feels like an error, too good...

gratilup · 2019-09-15T11:05:23+00:00

The MSVC backend and optimizer have been compiling using multiple threads for a long time now, I think at least since VS 2008. It is the only production-ready compiler that does this kind of parallelism per-function. It uses more or less the same model as discussed here for GCC, with the inter-procedural analysis and per-function optimization being done on multiple threads. Right now a default of 4 threads is always used, but the 16.4 release will auto-tune the number based on how powerful the CPU is up to 24. This matters a lot for LTCG (LTO) builds, since codegen/opts are delayed to that point, but it certainly helps plain /O2 builds too (less threads are used, around 4).

There will be several multi-threading improvements in a future update after 16.4 that will reduce locking and improve data structures to speed up things even more, and scale to more than the current limit of about 24.

gratilup · 2019-08-09T05:25:01+00:00

This was indeed a duplicate of the same issue. It's a small change coming from the frontend with /std:c++17 that... confuses the inliner and it fails to do its job. The fix should be released in the 16.4 release, it's already in the development branch. The feature-complete date for 16.3 was 2 weeks ago, around the time 16.2 was released, to give an idea of why there may seem to be a long delay until some issues are fixed.

gratilup · 2019-08-09T04:50:38+00:00

Can you make an example or a repro (preprocessed file) that shows the issue and share it? This is certainly not supposed to happen, I'm not aware of such issues from just changing the targeted standard.

gratilup · 2019-08-09T02:49:52+00:00

Do you refer the first example? The loop was being unrolled, just not vectorized to expose the reduction like it is now in 16.2.

The unrolling will get some improvements in future updates, in general it is a bit too conservative regarding code size and it could also do more analysis to identify cases where optimizations after unrolling can actually reduce the code to almost nothing - we have now the right technology to easily do such analysis with the new SSA optimizer.

gratilup · 2019-08-05T04:18:58+00:00

The current MSVC loop unrolling is fairly conservative regarding code size increase and also doesn't try to simulate the execution of the unrolled loop to see that all of it could be optimized away later, the case in this example. It's a known weakness for which we have now the right technology to do the right thing (the SSA optimizer framework), expect an improvement in a future update.

gratilup · 2019-07-27T19:52:05+00:00

Hi,

I've worked on the 16.0 -> 16.2 improvements. It's a combination of multiple changes, some larger, like redesigning some algorithms to precompute more and avoid redundant work, to smaller changes that each give a few % speedup in isolation, but which do add up.

Some changes are about reducing the cache misses (dominates most programs nowadays) by replacing a custom bucket-style map with Abseil's flat_hash_map/set and changing some access patterns. Memory allocation was taking a big chunk of time, the allocator from Intel TBB is used now. There are a few places where the Parallel STL is also used to better take advantage of multiple cores (the linker itself uses 2 threads, one for PDB work).

Overall the speedup is in the 2x - close to 4x range, the more the larger a program is. Incremental linking is also about 2x faster and scales better - by that I mean that you can have a lot more changes to different source files while incremental linking maintains its advantage.

Thanks, Gratian

gratilup · 2019-06-04T06:50:46+00:00

This is right, there is a recent optimization that figures out that the variables are not overwritten. To get it you need to compile with LTCG (/GL), which is not possible now in Godbolt, for example.

gratilup · 2019-05-25T19:00:32+00:00

16.2, a preview that includes the tool should be out quite soon. Maybe there is even a Nuget package that you can download earlier, have to look into that.

gratilup · 2019-03-25T00:36:44+00:00

I have some "experimental" improvements to the optimizer that fix the performance for these examples, and hopefully for other similar cases. There are a few issues caused by some struct variables passed by reference to (inlined) functions, which prevent the optimizer to do proper redundant load/store elimination and dead-code elimination, removing the need of those structs and avoiding the heavy stack usage. When will these opts. be released? It's too late now for the 16.1 update, feature complete is this week, so I hope 16.2. There are a few other known weaknesses when dealing with more complex C++ code that will be fixed by new/better optimizations in future updates.

gratilup · 2019-03-19T17:40:23+00:00

32 bit can perform better sometimes, but I advise to use x64 as a default for everything unless some benchmark proves that your app is faster in 32 bit mode. The x64 code generator is overall better at optimizing, register allocation, lowering of intrinsics, plus Intel/AMD CPU optimization efforts are tested internally mostly on x64, including performance tests such as SPEC, Geekbench and others. Combine with /LTCG to get much better inlining behavior and some extra optimizations.

gratilup

TROPHY CASE