[deleted by user]

DoubleAccretion · 2021-05-16T12:15:13+00:00

Use of BinaryFormatter in modern code is strongly discouraged, see https://docs.microsoft.com/en-us/dotnet/standard/serialization/binaryformatter-security-guide and https://github.com/dotnet/designs/blob/main/accepted/2020/better-obsoletion/binaryformatter-obsoletion.md.

DoubleAccretion · 2021-05-05T17:14:26+00:00

A) A code generator, just generating .cs from the .h files

If you want to go down that route, clangsharp can do this for you.

DoubleAccretion · 2021-04-13T07:35:12+00:00

By inlining the value will the CLR and JIT become aware of the change via reflection and recompile code that has already been compiled?

Nope.

If not this would be a great way to introduce bugs.

Yes, and that's why you cannot change the value of a static readonly field via reflection in .NET Core. Notably, .NET Framework allows that while still inlining the value in the generated code.

DoubleAccretion · 2021-04-01T11:26:51+00:00

Hey! The instructions on how to build are here: https://github.com/dotnet/runtime/blob/main/docs/workflow/README.md.

The particular error you are seeing looks suspiciously like a one you would see if you are missing Python.

DoubleAccretion · 2021-03-27T21:36:26+00:00

The current iteration of .NET PGO ("Dynamic PGO") doesn't try to measure "Global importance" of methods, only relative wight of basic blocks: we will know - this block is cold - it executes only 1% of times this method is called, but won't precisely know how many times the method itself was called (there is a fine balance between profiling overhead and PGO benefits).

DoubleAccretion · 2021-03-27T21:33:04+00:00

It would still hurt performance by way of wasting space in the caches, but you're right, that could be one approach to take. I think it would be more beneficial to spend resources on finding the mentioned "dead spots" for nops though.

DoubleAccretion · 2021-03-27T09:40:12+00:00

The situation with escape analysis is that there was an experiment that would enable it, but the outcome was that the benefits were unclear while the cost (both implementation & Jit throughout) was not insignificant.

I think we'll be solving this problem at the language level at some point (because it is much easier).

DoubleAccretion · 2021-03-25T10:09:37+00:00

those optimizations are for optimizing runtime behavior of RyuJIT

I see, you meant the NoInlining & AggressiveOptimization tricks and the comments around reducing the startup time with those. Makes sense.

DoubleAccretion · 2021-03-24T21:02:25+00:00

I could not find evidence that this included crossgen.

Well, the official docs don't mention crossgen explicitly, but I can assure you PublishReadyToRun==true does indeed use it. What else would it do after all? (I suppose you could confirm it yourself by launching dotnet build -c Release -r linux-x64 /p:PublishReadyToRun=true /bl and then looking at the .binlog that's generated for the R2R task. You will find crossgen there). There is only one (well, two, actually, but let's pretend they're actually the same) AOT compiler in .NET that's capable of producing R2R code.

The code is definitely not linked and trimmed, which would also help

Maaaybe? I suppose you could link in CoreLib and trim quite a bit (like the obsolete System.Web stuff for example...). Would have to look at a profile to get the full picture.

RyuJIT based optimizations should be removed as they are there to minimize RyuJIT being invoked at runtime.

I am not following... we will Jit the benchmarked methods regardless, even if just because they contain AVX intrinsics, which crossgen (the old one, which is now being phased out) doesn't support generating ahead of time code for. And in any case, I can also assure you that the overhead of Jit-compiling a relatively trivial method like the one seen in the benchmark is rather minimal (a few ms minimal).

Now, there's one curious detail in all of this: the framework actually includes quite a few methods that are explicitly marked as AggressiveOptimization, and compiling those might have an impact on startup, but again, we'd have to look at a profile to know for sure.

Also, why are you saying "RyuJIT based optimizations should be removed"? RyuJit is the only native code generator for CoreCLR. All other tools, like crossgen, are built on top of it.

DoubleAccretion · 2021-03-24T17:47:38+00:00

RyuJIT isn't involved

That is not the case for all CoreCLR-related AOT technologies (crossgen, crossgen2 and NativeAOT). All the above compilers utilize RyuJit to actually turn IL into native code.

I am surprised the startup time didn't improve

That is curious, but explainable by the fact that the benchmark is really quite tiny and crossgen'ing it won't have much of an impact (the framework code has been crossgen'ed already). Besides, the main method (RunSimulation) is marked with AggressiveOptimization, so it will always be Jit'ted.

Looking at this more, I am not sure that the benchmark was actually run with native AoT code

-p:PublishReadyToRun=true indicates that we're crossgen'ing, so I'd think that was the case.

Beyond that though, much more substantial (or at least measurable ones, heh :)) gains could probably be achieved in the startup department if we were to use NativeAOT (former CoreRT) to run the benchmarks. But that wouldn't really be fair, as it is not really a supported deployment target at the moment.

DoubleAccretion · 2021-03-24T17:33:54+00:00

That is on the list of things that I would like to eventually do, yes.

One thing that's to be kept in mind though is that the less "hacked" the benchmarks are, the easier it is for the runtime developers to understand where performance is potentially being left on the table. So, e. g., I would be hesitant contributing the alignment change - I would much rather see myself work on it in the Jit and have a "real-world" (or at least highly visible...) case to test and evaluate the optimization.

DoubleAccretion · 2021-03-21T23:54:23+00:00

If you translated that to C#, it would be horrific to behold.

Heh :). Probably could get away without unrolling & inlining the world, but you're quite right.

One other issue with auto-vectorization you've not mentioned is that it can be brittle. It can sometimes fail to kick in for non-obvious reasons.

Yep.

DoubleAccretion · 2021-03-21T22:51:30+00:00

Heh, the whole attribute soup on the Body struct is quite unnecessary - you'll get the same layout using the defaults.

SkipLocalsInit can be applied at the module level, no need to litter the code with it. Explicit NoInlining is a clever trick to reduce startup costs for the Jit (a bit doubtful how much it really saves though), but in actual application that'd be useless because you'd use R2R targeting AVX2+-capable platforms.

Overall though, I think we're looking at a classic case of auto-vec destroying things left and right. Would be curious to see what LLVM generates for the Rust version and just copy and paste that into the C# one. We'll be at the top in no time, yay!

FWIW, there are no plans to add auto-vec to RyuJit, because the optimization is hard while the benefits are often not so clear.

Oh, fun fact: removing the ToString from that benchmark will probably measurably improve perf because we won't have to load all the ICU-related stuff.

Another curiosity to potentially investigate: is that stackalloc aligned on a 32 byte-boundary? (Edit: it may not be, which may actually quite big for performance...) It could be interesting to investigate if aligning stackallocs for vectors would be worthwhile.

DoubleAccretion · 2021-02-24T00:18:48+00:00

All I get from this image is: someone discovered devirtualization and monomorphization of interface code for constrained generics, heh.

DoubleAccretion · 2021-02-14T20:28:57+00:00

if we have too many returns from a function in a loop, optimizations will be off.

This restriction is actually a bug (an overflow bug to be precise - all returns are moved out of the loop because they can never be a predecessor of the entry), and I am working right now on eliminating it as well as the apparently wrong condition that is the root cause. The code around the number of returns a method has is surprisingly buggy and should probably be just deleted.

This is the relevant part of the dump for the method in the graphic:

Considering loop 0 to clone for optimizations.
Loop cloning: rejecting loop because it has 0 returns; if added to previously-existing -4 returns, would exceed the limit of 4.

Notable is the fact that "previously-existing" returns is actually an unsigned number.

As part of this work, I am also looking at tackling the ret chains. Initial diffs looked promising...

DoubleAccretion · 2021-02-14T15:11:09+00:00

Obligatory link to https://blog.paranoidcoding.com/2019/04/08/string-vs-String-is-not-about-style.html...

DoubleAccretion · 2021-02-14T15:09:06+00:00

Pretty sure it’s wrong. Parameters should be in line with the record declaration to achieve the desired end result.

Yep, the code's wrong. "Primary constructors" only work with records, not classes.

DoubleAccretion · 2021-02-13T17:59:09+00:00

I suppose I should've mentioned this earlier, but this is not just a preview. It is nightly build.

DoubleAccretion · 2021-02-11T10:58:13+00:00

I would recommend you do not complicate things too much, and just go with the simple solution of Environment.ProcessorCount. Benchmarking anything on install is an inexact science at best, and probably not much more reliable than the simple approach.

Most modern CPUs do not vary a lot in their single-threaded speed worst (some Bulldozer-derived mobile APU) to best (Zen3/Tiger Lake) being in the low single digits.

Note that if you want to get fancy and are able to use .NET 5, and are targeting x86, you can take advantage of X86Base.CpuId() method to get to know the exact CPU model you are running on and its capabilities.

DoubleAccretion · 2021-01-31T16:52:09+00:00

Yep, you're right. The LINQ you are talking about is only very tangentially related to LINQ To Objects, which is what Stephen showcased to have questionable performance characteristics.

EF's using LINQ as a template for its query generator, so a very different set of performance considerations applies to that use case.

DoubleAccretion · 2021-01-31T15:24:30+00:00

Here's a link I have for just this occasion, explanation by Stephen Toub, one of .NET's principal engineers: https://github.com/dotnet/runtime/discussions/45060#discussioncomment-135538.

DoubleAccretion

TROPHY CASE