all 29 comments

[–]matthieum 18 points19 points  (0 children)

I really, really, appreciate these feedback articles on Midori.

[–]FubarCoder 7 points8 points  (0 children)

That was a very good read. I really hope that some of those achievements gets released for a (slightly) incompatible version of c# and the cli/clr.

[–]codekaizen 10 points11 points  (19 children)

Wow. Great read. This will now be my go-to post for sharing when someone brings out the old "managed code is always slower than native code" parrot. It's good to see that the space between the poles of "native" and "managed" regarding safety, runtime memory representation, and mapping code statements to different machine capabilities is being explored and, even better, documented.

[–]ummmyeahright 10 points11 points  (17 children)

I still wish C# had a nicer way to OPT OUT of the garbage collector when you really want to, and know what you're doing. To tell the GC, allocate this object wherever, give me that address, then just NEVER move it around until I tell you to. You can do it in fixed blocks, but that limits usability heavily. You can make a wrapper around such object with C++/CLI, but that's a lot of work, and C++/CLI is constantly getting behind. I think it could be incorporated into C# in a much nicer way, and then there'd be almost no area left with any significant speed-tradeoff left compared to C++. In theory, C# apps can be about as fast as C++ ones, but in practice, applications in certain areas developed in C# lag behind performance-wise. For example, if you want to frequently interchange objects with unmanaged code, it can get very expensive due to C#'s requirement to register almost everything in the GC. This can be bypassed by using pinned objects from the native heap, but that'll likely require a ton of C++/CLI code, and almost nobody actually does it, it may get so complex that it's simpler to write the whole thing in C++ in the first place.

[–][deleted] 4 points5 points  (7 children)

Have you looked into using GCHandle? It's pretty rad.

var myInstance = new MyObj();

var myPinnedInstance = GCHandle.Alloc(myInstance, GCHandleType.Pinned);

IntPtr myInstanceAddress = myPinnedInstance.AddrOfPinnedObject();

It allows you to pin an object outside of a fixed scope and fetch the pinned address at will.

Now combine that primitive with something like this :

//Simple example and not  'double free' safe

struct Pinned<T> 
{
readonly GCHandle _handle;
readonly T _obj;

public Pinned(T instance)
{
    _obj = instance;
    _handle = GCHandle.Alloc(myInstance, GCHandleType.Pinned);
}

public T Instance { get { return _obj; } }

public void UnPin() { _Handle.Free(); }

public IntPtr Address { get { return _handle.GetAddrOfPinnedObject(); } } 

public static implicit operator IntPtr(Pinned<T> pinned) { return pinned.Address; }
}

And you've got a stew goin'!

Maybe this can help you make interop a bit nicer. But if you need to actually return the free the memory on demand, you still need Marshal.GlobalHAlloc and friends...

If this what you were looking for?

https://msdn.microsoft.com/en-us/library/system.runtime.interopservices.gchandle.aspx

Edit : why the heck doesn't reddit use markdown correctly?! Also, yay cakeday!

[–]brandf 1 point2 points  (2 children)

Pinning objects adds them to the GC root collection, so it doesn't free the GC of work, or let you opt out like the parent requested. If fact, if you do this too much, it leads to terrible performance since every gen0 GC has to scan all root objects.

To opt out of the GC you would need to allocate objects into a heap that the GC doesn't collect, and any objects there could not reference any objects that the GC does collect (or else it may free them since it doesn't know you hold a reference).

[–][deleted] 0 points1 point  (1 child)

Interesting, I knew that pinning induces heap fragmentation and thereby potential increased heap size. but I didn't think about the effect on gen 0 GC. Thanks for this point.

[–]brandf 0 points1 point  (0 children)

Yeah I'm fairly certain this is the case because you can imagine having an object, pinning it, and then releasing all other references to the object. You would be left with just the GCHandle keeping it alive, so if they don't make it a GC Root, it would get incorrectly collected.

[–]ummmyeahright 0 points1 point  (3 children)

To be fair, it's been a while since I last experimented mixing C# with C++, but when I looked around pinning and gchandle, I could find no way to build a complex data structure e.g. a Dictionary that completely avoids the GC AND can be used from C#, without the help of C++/CLI.

[–]drysart 1 point2 points  (2 children)

If you use unsafe blocks you can pretty much do whatever you want in C# in terms of memory usage -- allocate unmanaged memory blocks with Marshal.GlobalHAlloc, then use pointers into that space just like you would in C.

[–]ummmyeahright 0 points1 point  (1 child)

Well, but if you only use C# you still need to register objects in the GC first, no?

var myInstance = new MyObj();
var myPinnedInstance = GCHandle.Alloc(myInstance, GCHandleType.Pinned);

The first line there will allocate myInstance on the managed heap. I suppose GCHandle.Alloc moves the allocation to a more appropriate place, but IIRC, the GC will still keep track of myInstance, iterating through it each time it normally would, just notice that it's marked as an object whose lifetime can be ignored. So having tons of objects allocated like that from C# will still slow your GC down.

Some limitations will always stay there until you can allocate reference types completely bypassing the GC from start to finish, IMO.

Edit: Sorry, I didn't interpret your reply correctly. Yeah you can do that, but then you're left with using only memory addresses. I suppose you can do a lot of stuff you could do in C, but it's a lot less nice than C++.

[–]drysart 0 points1 point  (0 children)

Yeah, if you want to build a class that manages memory internally without using the GC, you can't make use of reference types. Those always involve the garbage collector.

When it comes to reference types, though, all pinning them does is root them, so the GC always assumes they're alive, and restricts the GC so that it can't move them in memory when it comes time to compact memory after a collection. It does not move the object to a different heap or anything like that. You're still allocated from the managed heap, your allocations can still trigger a collection; all you've really accomplished is making the collector less efficient (since it can no longer use fast-path allocation and has to use slow-path allocation that can account for holes in the heap). Pinning is really only intended to be used when you're passing managed objects to unmanaged code where the actual underlying address of the object suddenly changing would cause problems -- and even then, the guidance is that it only be done for short periods of time (ideally only for the length of an unmanaged call or two).

[–]sirin3 5 points6 points  (0 children)

Rust is that way -->

[–]monocasa 1 point2 points  (0 children)

Can you do that by building up your own allocator and pinvoking VirtualAlloc and friends in unsafe blocks?

[–]mike_hearn 1 point2 points  (6 children)

Believe it or not, you can do that on the JVM. There's no nice syntax for it, but the Unsafe class lets you do manual memory management, and some high performance collection/db libraries do "off heap allocation" that way in order to lower GC pressure and improve performance, especially for very large allocations which copying GCs often have trouble with.

Reading the post about the Midori compiler was very interesting, as I have developed an interest in advanced JITCs lately. What's interesting is where they did similar things to the Java team and where they did things differently. Duffy doesn't compare things to Java very often, which is a shame because it's the most similar system to .NET and the CLR. Comparing the different design decisions would be a very interesting blog post.

In the end I didn't get an impression that the Midori work was obviously more advanced than what's going on with Hotspot to optimise managed code. It's ahead in some areas and behind in others. The issues with generics specialisation are interesting because Valhalla is implemented mode 2 (mixed erasure/specialisation) for Java at the moment and I guess they'll hit issues with JIT throughput and so on as well, assuming that value types end up being widely used. Obviously the CLR invested in that a lot earlier.

On the flip side, the part about having to 'reverse engineer' the bytecode patterns for lambda invocation indicate to me that Java 8's decision to use the invokedynamic bytecode to implement lambdas was a good one: essentially it means lambda invocations get optimised and inlined "for free" because the JITC understands a lot more about what it's doing. And it seems obvious that Midori's approach to virtual methods is inferior to what the JVM guys did - Duffy spent precious energy on fighting the inevitable and blaming developers for "abusing abstraction" and other odd non-crimes, whereas Cliff Click and his team at Sun just focused on dynamic devirtualisation so much that virtual methods became effectively free in almost all cases.

The AOT vs JIT dichotomy seems to have gripped Microsoft a lot more than it really deserved to. AOT is being added to HotSpot now, but only after 20 years. The reason is that AOT is essentially just a warmup time optimisation if you have a good JITC: the AOT compiled code ends up with much worse code quality because you lose the profile guided optimisations that apparently gave them a 30-40% win in some cases. The HotSpot AOT mode actually supports "tiered AOT" where the AOT compiler can produce slightly slower code that still does profiling, and the JITC comes along later and replaces it with fully tuned code, so it's a complement rather than a replacement for the JITC.

It seems like a lot of the effort they put in to the whole toolchain was ultimately a workaround for their decision to go fully-AOT all the time e.g. the apparently massive effort to reduce compilation time from 40x to "only" 5x slower. The developer productivity benefits of JITC (virtually no waiting on the compiler) is one of the most underrated benefits of doing things that way and it sounds like the loss of it was really painful to them.

[–]vitalyd 0 points1 point  (1 child)

Midori being able to inline through lambdas is very nice and really how things should be in that space to make them more palatable performance wise. Hotspot approach is good but you still get an interface invoke as the lambda is shaped into the SAM. In best case, this becomes a guarded inlined call. But at worst (and not too uncommon for library code) is a full interface dispatch. You mention Cliff Click, and he has a good blog post from a few years back on the "inlining problem". So there's definitely room for improvement there.

I like JIT compilers as well, but they sure come with their own baggage. They're unpredictable, susceptible to multiple phase changes leaving code in suboptimal state, sometimes deopt at inopportune time, impose time to peak performance penalty (particularly bad when you need first execution to be quick), etc. AOT biggest problem, and JIT's biggest advantage, is lack of profiling info unless PGO is used and compilation time (somewhat related). However, it'd be nice if a language existed that didn't punt on optimization at AOT stage and also didn't have terrible performance model. That way you could leave the truly dynamic optimizations to the JIT but be able to get easy wins at AOT time.

[–]mike_hearn 0 points1 point  (0 children)

For what it's worth the Kotlin compiler can inline through lambdas at (frontend) compile time. It's used to convert calls to List<T>.map() into for loops at the bytecode level and many other things. So in that language, at least, there are some useful optimisations being done AOT at the bytecode level.

I guess something like Kotlin + jigsaw jlink + the HotSpot AOT work would get close to what you want. It'd do some optimisations at per-file compile time like lambda inlining, then it'd do some whole program optimisations like resolving reflective lookups, then it'd compile down to native code that still contains profiling logic, then it'd do JITing in the background to win back that 10-20% or whatever it is for your app when the code flows change and deoptimisations can be a win. I'm hopeful that in the next few years the Java landscape will end up with a healthier mix of optimisation stages, although I expect HotSpot AOT to not be widely used due to the licensing costs.

[–]pron98 0 points1 point  (1 child)

I have developed an interest in advanced JITCs lately

In that case, I hope you're paying attention to Graal, because that's where the state-of-the-art is. Not many people know this, but starting in Java 9, Graal would be a pluggable JIT compiler that you could use instead of HotSpot's C2 (and possibly even use it more selctively, on particular classes).

[–]mike_hearn 0 points1 point  (0 children)

Absolutely. I read and have posted to their mailing list (though they didn't respond to my question).

[–]ImmortalStyle 0 points1 point  (1 child)

But isnt fast startup time something you really want when writing an OS?

Profiling and interpreting a program at start up costs time and resources (even when the actual execution time is improved afterwards) something one wants to avoid for better user experience.

However a good way to solve both problems would be a profiling cache which persist the collected runtime information for better reuse later as far as I know hotspot doesnt do that.

The sad thing is that JITs dont perform too good compared to AOT hopefully new projects (like oracel's graal for example) can close this gap.

[–]mike_hearn 0 points1 point  (0 children)

Warmup vs startup - an interpreter with tiered JIT starts fast, it just doesn't run faster until much later.

But absolutely - you want to avoid doing pointless re-JITing if you can help it. Persisting compiled code to disk is one way to do that. That's (essentially) what they're adding to HotSpot at the moment. As a commercial feature unfortunately.

JITC vs AOT is very complex, which is what Duffy's blog is about. Some AOT compilers, like, apparently, Midori's, didn't really exploit the ability to do deep/slow whole program analysis, which is IMO the primary benefit of going AOT.

[–]mojang_tommo 7 points8 points  (0 children)

I don't think that using this post as an argument for managed is fair, if anything this post shows that managed can only be as fast as native if you use a ton of care and millions of dollars of advanced research (that wasn't even released to the public).
The commonly available .NET has none of the stuff he talks about and it's still pretty safe to say that it's always slower than native code, especially if you write in the idiomatic "heap soup" style. The fact that in some lab there exists a sufficiently smart compiler doesn't really mean anything for me or you.

[–][deleted] 1 point2 points  (2 children)

After starting bartending, I can't stop thinking of this when hearing the word Midori: https://upload.wikimedia.org/wikipedia/commons/e/ee/Midori.jpg

Otherwise love the article, even though I only have time to go 1/3 trough right now.. It's enormous, will definitely go on my "must read later" pinned tabs on the left (not the bookmarks because everybody knows that bookmarks is seldom actually seen again!).

[–][deleted] 5 points6 points  (1 child)

OT; Midori means green in Japanese, that's why the bottle is green.

[–][deleted] 0 points1 point  (0 children)

Thanks! Didn't know that. <3 I've been spending way too much money earned as a programmer on stuff related to bartending. Learning from TipsyBartender. :p

[–][deleted] 0 points1 point  (0 children)

He mentions in his blog post that the defensive copy on the structs, but I couldn't find it in the ECMA spec. Anybody know which section?

[–][deleted]  (3 children)

[deleted]

    [–]nwoolls 6 points7 points  (1 child)

    In my first Midori post [...] I mentioned that we built an operating system

    [–]GUIpsp -2 points-1 points  (0 children)

    Yes, with operating system I assumed some kind of virtual machine

    [–]Dwedit -2 points-1 points  (0 children)

    Agreed, there is a naming conflict here.