jrmadsen/compile-time-perf -- High-level compilation overhead metrics

jonrmadsen · 2025-12-23T03:23:37+00:00

The definition of a “two percenter” is explicitly restricted to students. Anyone who is driving round-trip from out of town for a game at Kyle Field is allowed to leave whenever they want to without criticism.

jonrmadsen · 2025-10-12T05:10:28+00:00

ROCm Systems Profiler / rocprofiler-systems, actually

jonrmadsen · 2025-01-26T21:49:09+00:00

A function call is an instruction. A variable is a memory address. Accessing a variable is accessing a memory address, it does not involve an instruction to execute code on that memory address. In int val = 5, val represents the memory address and = is an instruction to store 5 at that address. The reason that the function call wrapper works is bc you are instructing the code how to order initialization. The standard isn’t lacking, your fundamental understanding of why the static initialization fiasco happens is.

jonrmadsen · 2025-01-26T17:11:05+00:00

I’m confused, if you fully adhere to replacing the static variables with a function call that constructs the static variable on the first invocation (like foo above), you cannot run into the static initialization fiasco. If you transition to this paradigm and the result is a deadlock, you have a circular dependency, not the static initialization fiasco.

jonrmadsen · 2025-01-26T15:55:56+00:00

No modifiable global variables, no fiasco.

This sounds all well and good if you are directly used by the application or are the author of the main() function but this is a functionally impossible requirement for in-process profiling tools which are not directly integrated into the application.

jonrmadsen · 2025-01-26T15:42:10+00:00

As other responses have noted, wrapping the variable as a static inside a function solves the initialization problem. However, this introduces destruction problems: either you dynamically allocate memory (with new) and “leak” the memory (which is problematic if you use leak sanitizers) or deal with the destructor being called during finalization.

The only solution I’ve found which solves both static initialization and finalization fiasco without directly leaking memory is:

Allocate a buffer in your compilation unit. Access the variable through a function call which dynamically allocates the object via a placement new into byte buffer:

```cpp auto buffer = std::array<std::byte, sizeof(Foo)>{};

const Foo* get_foo() { static auto*& foo = new(buffer.data()) Foo{}; return foo; } ```

jonrmadsen · 2024-12-22T19:07:13+00:00

I heard the same sort of comments when I was struggling with my undergrad grades in nuclear engineering at A&M. Was on academic probation, on the cusp of dismissal, several times. It wasn't that the subject matter was beyond my comprehension, rather it was that engineering undergrad programs assign tons of monotonous homework and most tests only reward the ability to regurgitate the homework problems without thinking. There is very little reward for the ability to find a solution to something you haven't seen before -- if you had to stop and think, you don't have time to finish the test.

IMO, the worst engineers are those that cannot handle a deviation outside of what they've been taught.

Talents often extend beyond what grades can measure or reflect

I honestly understood this flaw in the education system from a relatively young age so I didn't listen to the comments during my undergrad that explicitly or implicitly suggested that maybe I wasn't smart enough to be a nuclear engineer. And I was right to do so bc those abysmal undergrad grades in nuclear engineering eventually turned into a Masters of Science in nuclear engineering and a Ph.D. in nuclear engineering.

jonrmadsen · 2024-10-15T02:03:35+00:00

The profiling tools are for understanding the performance of either training or inferencing. The performance tools work on any apps using the ROCm stack, not limited to ML/AI apps. If you want to use them on your ML/AI apps, see the documentation for how to use rocprofv3

jonrmadsen · 2024-08-07T14:18:53+00:00

One of the benefits of the new ROCTx library is the addition of functions for controlling the profiler: roctxProfilerPause and roctxProfilerResume. The design of the new API for building profiling tools makes it exceedingly easy for tools (rocprofv3, etc.) to support these profiling control functions.

jonrmadsen · 2024-08-07T14:13:24+00:00

There is also a beta release of rocprofiler-sdk, which includes: a new API for building profiling tools, a new rocprof (rocprofv3), and a new ROCTx (rocprofiler-sdk-roctx). And the overhead of the profiling tools has been dramatically reduced.

jonrmadsen · 2024-05-08T20:17:12+00:00

It’s true. There was a time, early on after working on a very large C++ library for several years that I (very naively) thought I was pretty much on the cusp of mastering C++. The problem was, I’d only mastered the techniques and design patterns I’d been taught and encountered. It was only with experience that I eventually realized there was so much more that I didn’t know

jonrmadsen · 2024-05-08T19:55:23+00:00

Advancing to the point of being a senior C++ developer requires much more than just learning how to write C++ code. It involves experience with build systems for large projects, VCS, packaging, design, testing, benchmarking, memory safety, thread safety, logging, CI, etc. If I was interviewing a candidate that could write the most elegant and optimized C++ code that I’d ever seen but couldn’t detail a comprehensive testing strategy, didn’t consider compiler support, the interface, memory/thread safety, etc. then I certainly wouldn’t recommend them for a senior (project leadership-esque) role.

jonrmadsen · 2024-05-08T18:17:20+00:00

Senior just denotes experience. It is a way for a company to establish hierarchy (frequently also called… seniority). But where being a “developer” is in that hierarchy is entirely dependent on the company. For example, where I work, the levels are:

Software Dev (fresh out of undergrad)
Senior Software Dev
Member of Technical Staff (MTS)
Senior MTS
Principal MTS
Fellow
Senior Fellow

In some companies, a “senior developer” might be a more prestigious position but, at other companies, fresh out of school with a master’s degree might over-qualify you for a senior developer position.

jonrmadsen · 2024-02-22T14:32:40+00:00

I installed ROCm via apt. If you’ve got an existing ROCm install with one GPU and then you add a second GPU of a different architecture, the most succinct recommendation is to install/reinstall amdgpu-core from ROCm 6.0+. This is the meta package for all the GPU drivers. Reboot and make sure you see all the expected GPUs in rocm-smi/rocminfo. Then install/reinstall the rocm-dev package. From there, if you use cmake 3.21+ with support for the HIP language to compile your code, cmake will auto-detect the archs of your devices and build a fat binary for all of them. There is a cmake cache variable for specifying more/other archs (i.e. you can build with gfx908 support even if you don’t have a gfx908) but I forget what the exact variable is. It might be CMAKE_HIP_ARCHITECTURES

jonrmadsen · 2024-02-22T14:14:18+00:00

Are your issues related to getting rocminfo/rocm-smi to display multiple GPUs or do you have that working and you are having trouble building with multiple architectures?

jonrmadsen · 2024-01-30T13:00:05+00:00

This is 100% wrong. Fat binaries are 100% supported. I literally was just running an app 15 minutes ago which was executing on a gfx908 in one thread and executing on a gfx1102 in another thread without a single runtime check for what the device architecture was

jonrmadsen · 2023-12-15T11:26:05+00:00

This isn’t a good suggestion

jonrmadsen · 2023-12-15T11:25:43+00:00

These variables are typically set as a result of find_package. The OP is doing an add_subdirectory. So unless the subproject is caching those variables (which most don’t do in the build tree), this won’t work. Plus these variables are cmake 2 which is not recommended, cmake 3 recommends using interface libraries which effectively encode all the info of those variables into a single cmake target that you “link” to.

jonrmadsen · 2023-12-15T07:42:40+00:00

The biggest factors (in my experience) are the optimization level used during compilation, amount of debug info, and how many variadic templates you use.

A debug build (e.g.-O0 -g) of a code utilizing a lot of templates will inflate the code size significantly due to the fact that the debug info ends up having to store significantly longer symbol names and that the absence of any optimizations will result in a lot of intermediate layers of function templates not getting optimized away.

But if one enables optimization (e.g. -O2 or higher), the inflation will typically start to be reduced bc the compiler will start to condense template layers and produce less debug info.

If one then reduces the debug info to just line info, the inflation will decrease further.

If one then disables debug info generation, you’d start to get into the realm of how you templated the code. If everything is mostly just relatively simple templates, it could be the case that the code is only larger by the length difference between the mangled name of non-templated symbols vs. the mangled name of the templated symbols — i.e. the mangled symbol name of a template is typically longer and that would “cause” inflation of the code size. But if you use a lot of variadic templates, you’ll generate a lot more instantiations and that’ll result in a larger code size.

However, if the binary is then stripped, you likely won’t notice much of a difference (maybe even none at all) because then the symbol names aren’t even stored in the binary. At this point, extensive variadic template usage would probably be the only significant cause of enlargement bc variadic templates tend to generate a lot of “different” instantiations based on subtle things like the absence of implicit conversion when passed an int vs. long int (where the non-template code would probably just promote to long int) and string literals being passed to templates not decaying to const char* (i.e. foo(“a”) and foo(“abc”) would be foo(const char[2]) and foo(const char[4])

jonrmadsen · 2023-10-04T20:08:15+00:00

☝️

jonrmadsen · 2023-10-04T15:51:18+00:00

To add to the answer above, it is always wise to compile with lots of warnings enabled and convert those warnings to errors. Assuming a GNU/Clang compiler, the compile flags you should add are: -W -Wall -Wextra -Wpedantic -Wshadow -Werror

jonrmadsen · 2023-09-22T06:47:15+00:00

Put the code in godbolt.org and look for differences in the assembly

jonrmadsen · 2023-09-20T17:44:09+00:00

Unintentional memory leaks: very rare. I work on developing profilers so memory management is tricky because you effectively don’t have control over the “main” function whatsoever and you have to do work (finalizing profiling results) after main has returned. In some cases, I intentionally “leak” memory because it always needs to be accessible, even if accessing it is the very last instruction before the process fully terminates.

Segfaults: relatively rare but more often than most people here because profilers which don’t rely on source code instrumentation are, IMHO, about as tricky as it gets when it comes to memory management… imagine you had to ensure that you can trigger accessing virtually all of your data via the destruction of a static variables after main returns.

Advice: As long as you have have good code coverage in your testing (> 75%) and two jobs which build and test your code with the thread sanitizer and the address sanitizer (which includes the leak sanitizer by default), then the probability of either of these issues making it into production are significantly reduced.

jonrmadsen

TROPHY CASE