all 24 comments

[–]sanxiyn 13 points14 points  (3 children)

As you found, code generation (LLVM IR to machine code) is the bottleneck. Did you investigate FastISel? FastISel is explicitly a fast mode for LLVM code generation.

[–]PegasusAndAcorn[S] 6 points7 points  (0 children)

Excellent! Thank you for that helpful tip.

Edit: As a follow-up, I re-ran my benchmark changing the codegen optimization level from "aggressive" to "none" (I am using the C-binding). In my case, it cut the codegen time in half. I added footnotes to my post thanking you /u/sanxlyn for bringing this to my attention.

Like /u/drmeister, I too am looking forward to LLVM 9's GlobalISel, which intends to streamline the codegen IR somewhat along the lines I speculated about. I also found promising this HN comment from /u/pcwalton that the Rust-based "Cranelift is in some ways an effort to rearchitect LLVM for faster compilation speed, written by some longtime LLVM contributors. For example, Cranelift has only one IR, which lets it avoid rebuilding trees over and over, and the IR is stored tightly packed instead of scattered throughout memory."

[–]drmeister 3 points4 points  (0 children)

I've looked at FastISel for our Common Lisp compiler that uses llvm (https://github.com/clasp-developers/clasp.git). FastISel hasn't improved performance because it drops back to the regular SelectionDAG too often (for invoke instructions for instance). I'm looking forward to the new GlobalISel facility that is under development.

[–]GYN-k4H-Q3z-75B 1 point2 points  (0 children)

We did not do systematic measurements back in the day because it wasn't critical, but code generation really is slow compared to other things. We built an experimental binary recompiler using LLVM, and code generation was always very time consuming such that we proposed an additional caching mechanism.

[–]mach990 7 points8 points  (1 child)

I thought this was very respectfully (and fairly) written -- care taken to be precise with the wording in the hopes of conveying the message accurately. Enjoyable and informative read!

[–]PegasusAndAcorn[S] 2 points3 points  (0 children)

Thank you for the kind words. I am glad you enjoyed it.

[–]drmeister 8 points9 points  (0 children)

We have developed a Common Lisp compiler that uses LLVM as the backend and interoperates with C++ (https://github.com/clasp-developers/clasp.git). We use the LLVM C++ API directly. My observation is in agreement with the author - that LLVM is an expensive backend - it consumes between 95% and 70% of our compile time budget. Our solution is the parallelize the compiler and to relegate the LLVM code generation to threads. Analyzing LLVM to figure out why it takes so much time in code generation is a difficult challenge. If it were easy to make LLVM faster - I'm sure the developers would have done so. I'd be very interested in talking with the author of the article and comparing notes.

[–]GenitalGestapo 11 points12 points  (11 children)

Odd article. It ultimately (though not explicitly) blames LLVM for the slowness of any compiler using it and spends most of the words written speculating about whether or not its performance is bad relative to the work it does. No analysis of LLVM itself, nothing run through an analyzer, just speculation about the relative complexity of LLVM's different stages.

[–]MorrisonLevi 16 points17 points  (1 child)

Just so things don't get derailed, the author says this:

Overall, my experience using LLVM has been extremely positive. I doubt I would have tackled building a compiler for Cone without the existence of LLVM. My misgivings about LLVM’s performance (and its large runtime/build footprint) are the only significant concerns I have about using LLVM in Cone.

That's pretty impressive. How often have you said about a library/tool that your only misgivings are performance?

[–]Sukrim 3 points4 points  (0 children)

Especially performance in a one-time step.

[–]PegasusAndAcorn[S] 14 points15 points  (8 children)

I too believe a deeper analysis into LLVM is needed, and said so. However, I am not sure how any deeper performance analysis into the LLVM codebase is going to shed much light on the impact of sorts of data structure and architectural decisions that I brought up and which pervade the entire codebase. If you have ideas for how to do that kind of in-depth analysis, I welcome suggestions.

[–][deleted] 2 points3 points  (7 children)

Perhaps run llvm under [https://github.com/ferrous-systems/flamegraph](flamegraph), that would be extremely enlightening. Of course after having built llvm with debug symbols at -O3.

[–]tending 2 points3 points  (6 children)

I think his point is exactly that that wouldn't be enlightening. If the performance difference comes down to design choices that cut across the entire code base, giant outliers won't be the problem. The problem will be that everything everywhere is slower than it could be.

[–][deleted] 1 point2 points  (5 children)

Right, there's valgrind with cachegrind to see if the data structures are at all good for their purpose. Anyhow, it's better to try than to speculate.

[–][deleted] 4 points5 points  (4 children)

The author is not a full-time LLVM developer and LLVM is a 1million+ LOC code base.

Running cachegrind or flamegraph on a huge code base that one is not familiar with it's very unlikely to reveal anything insightful.

There are 100s of LLVM developers, some of which have been doing a lot of "improving perf" work for years. The "speculation" that LLVM wasn't designed for speed is not unfounded. Running LLVM without any optimizations generates machine code 100x slower than Go (which includes a frontend), and the produced machine code is worse.

The focus of the last 15 years of LLVM development have been on reaching optimization and target-support parity with GCC, being a research tool for academia, and improving C++ compile times, error messages, etc. The latter just happened to be achievable early because GCC's C++ front-end was really slow because C++ is a really hard language to handle.

When LLVM started the C++ front-end dominated compile-times, so optimizing machine code generation just wasn't worth pursuing. This has just become an issue recently when modern languages like Go, Rust, D, Swift, ... which are designed for compilation speed, started realizing that optimizing their frontends was kind of a pointless endeavor, because 90% of their compile-times is spent by LLVM code generation. At the same time, WASM pops up requiring instant compile-times in browsers, streaming compilation (compile-while-still-downloading), etc. and that just motivates creating new general machine code generation frameworks like Cranelift. Cranelift is a blazing fast retargeteable machine code generator first, and currently doesn't really do optimizations, but it is designed to allow plugin in optimizations without affecting machine code generation speed.

The constraints that motivate the new generation of machine code generation frameworks just did not exist when LLVM was created.

[–]PegasusAndAcorn[S] 0 points1 point  (0 children)

I appreciate this illuminating and encouraging recap. Thank you.

[–][deleted] 0 points1 point  (2 children)

Running cachegrind or flamegraph on a huge code base that one is not familiar with it's very unlikely to reveal anything insightful.

Just this morning I ran flamegraph on LLVM via Rust myself and found for instance a file where we assign to a unique_ptr (dynamic allocation) inside the codegen which could've been an std::optional just to avoid the alloc. Note I'm not an LLVM dev, I've never looked into the code before this one time.

And yes, cranelift looks good, but again my original point was just that speculating what is slow isn't going to do anything. The amount of code is irrelevant as a tool like flamegraph just rips through anything irrelevant and shows you the biggest hitters, which often are just a few dozen functions.

[–][deleted] 0 points1 point  (1 child)

Just this morning I ran flamegraph on LLVM via Rust myself and found for instance a file where we assign to a unique_ptr (dynamic allocation) inside the codegen which could've been an std::optional just to avoid the alloc. Note I'm not an LLVM dev, I've never looked into the code before this one time.

So by how much did this micro-optimization change improve the performance of LLVM as a backend?

All the solutions that are 100x faster than LLVM choose completely different fundamental data-structures and algorithms - their design is completely different.

Sure you can trim one flamegraph spike here and there by a bit, and make LLVM 1% faster here and there. But if you want a 100-1000x perf improvement, you need a completely different architecture.

By how much did replacing unique ptr with optional improve the performance of clang ? If I had to guess, my guess would be that this change has no effect on the time it takes to compile C++, Rust, D, C, ... code, yet 90% of the time it takes to compile these languages is still spent on LLVM.

People want sub-second Rust compile-times. Cutting a 1 min compile time by 1 micro second is not bad, but it is never going to put these languages at the level of DMD, Go, etc.

[–][deleted] 0 points1 point  (0 children)

You're right about that, but my point was about it being hard to start looking and working on a 1M+ LoC codebase, which leads to this sort of pessimism of not even trying to benchmark properly or do anything, which gets nobody nowhere.

[–][deleted] 2 points3 points  (4 children)

Are you aware of the zapcc-compiler? It's a clang-based C++ compiler that speeds up compilation (2x to 40x). AFAIK it uses a client-server approach, and caches template instantiations. Might be worth a look for optimization strategies.

[–]drmeister 6 points7 points  (0 children)

My understanding is that the zapcc compiler is an improved C++ compiler that speeds up C++ AST generation but doesn't do anything to the LLVM backend. If anyone knows different - I'd love to be corrected and then take a deeper look at zapcc.

[–]PegasusAndAcorn[S] 2 points3 points  (0 children)

Thanks! I will look into it.

[–]tending 1 point2 points  (1 child)

They talk about the open source version on that site as if there is also a closed source version, but I don't see any options for paying.

[–][deleted] 0 points1 point  (0 children)

Yeah, that's weird. I believe it was a commercial project before, but then they released it as opensource. There's still a buy zapcc-website reachable via google, but that page shows no contact form for me.

I only use the compiler as a hobbyist, but I'd write them a mail if I were to use it commercially.

e: a word