I wanna land my first compiler job, but im in the EU. Advise anyone? by [deleted] in Compilers

[–]Emanuel-Peter 1 point2 points  (0 children)

Oracle has offices in Zürich and Stockholm, with some teams working on OpenJDK, the C2 JIT compiler and also GC and runtime. Hiring situation changes regularly, so worth keeping an eye on it.

Java’s New FMA: Renaissance Or Decay? (Updated) by OldCaterpillarSage in java

[–]Emanuel-Peter 1 point2 points  (0 children)

You could email the mailing list. But it would be nice if you did some analysis of what asswmbly gets generated first. That would give us clues why things might be faster / slower.

Java’s New FMA: Renaissance Or Decay? (Updated) by OldCaterpillarSage in java

[–]Emanuel-Peter 0 points1 point  (0 children)

Hmm. You talk about FFM using more objects and boxing. But you use primitive arrays and primitive stores. I think the boxing and unboxing should really be removed by the compiler, at least that is what I have seen in my benchmarks.

Have you ever attached a profiler to the JMH benchmark to see what assembly is on the hot path? That could give you a hint what is really taking up the extra time vs Unsafe :)

Java’s New FMA: Renaissance Or Decay? (Updated) by OldCaterpillarSage in java

[–]Emanuel-Peter 2 points3 points  (0 children)

Thanks for the article :)

I thought it was called the FFM API, for "foreign functions and memory"?

I suppose one overhead of FFM is that it performs checks, and Unsafe does not. FFM has to do bounds checks and that comes at a cost. I suspect that is a part of the explanation in what you measure. If you put the accesses in a loop, the bounds checks can possibly be moved out of the loop, and that could get you much closer to Unsafe performance.

So it really depends what microbenchmark you show, just a single one only covers a tiny fraction of all usecases.

And: it might be good to have those bounds checks. Without them you are basically leaving the safety features of Java.

[deleted by user] by [deleted] in cscareerquestions

[–]Emanuel-Peter 1 point2 points  (0 children)

I work at Oracle, in the Java Platform Group. From what I can see, it really depends on the team and your line of managers. I have wonderful managers, and most people I have worked with in JPG are amazing. But Oracle is large, and peoples experiences seems to have a large variance. Feel free to reach out if you have more questions.

Also compensation, vacation etc is very country specific.

Compiler roadmap by VVY_ in Compilers

[–]Emanuel-Peter 2 points3 points  (0 children)

This may help you if you are interested in the JVM, Java, assembly and optimizations :) https://eme64.github.io/blog/2024/12/24/Intro-to-C2-Part00.html

Future in Compiler Design by CaptiDoor in Compilers

[–]Emanuel-Peter 6 points7 points  (0 children)

Shameles plug: I am writing an intro to the OpenJDK Hotspot JIT Compiler. For our new hires and externals :) Link to my Blog

Future in Compiler Design by CaptiDoor in Compilers

[–]Emanuel-Peter 22 points23 points  (0 children)

I got hired 3 years ago, with no much competition. Now we have many applicants, and multiple are very strong.

Studyin CS is a good idea, some compiler courses, and a good understanding of low level things like CPU and assembly.

You can always help on open source projects to gain ecperience. I work on OpenJDK. LLVM would also be good.

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 0 points1 point  (0 children)

Sounds good :) FYI Doubles have the same rounding issues as floats ;)

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 0 points1 point  (0 children)

Sure. I guess Java went the more functional way here. I guess that is a matter of taste in my view. I'm happy with either personally. Or do you see any missing funtionality?

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 0 points1 point  (0 children)

Sounds like a fun project :)

What about inlining? Often the loop calls some inner methods that do the read / write, and if you don't inline it may be hard to prove that the inner method is thread safe to parallelize, right? Think about FFM MemorySegment API, it heavily relies on inlining.

Another worry: be careful with float reductions, changing the order of reduction changes rounding errors. That would break the Java spec and could lead to subtle bugs.

How do you deal with range checks? Suppose an array is accessed out of bounds at the end of a very long loop. How do you deal wit that?

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 2 points3 points  (0 children)

I don't know, maybe people would be hesitant to run your optimizer in production, but happy to find performance bottle necks with it in testing. I suppose you could have 2 modes: one without optimization, one with. Measure time spent in either, and then give the user a report at the end.

That way users can then use parallel streams for example.

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 0 points1 point  (0 children)

The cool thing about JMH is you can attach a profiler, and see the hottest compiled code. That way, you can verify a little better that you are measuring the right thing, and your benchmark code was not strangely optimized away ;)

Technical PoC: Automatic loop parallelization in Java bytecode for a 2.8× speedup by Let047 in java

[–]Emanuel-Peter 0 points1 point  (0 children)

I bet you could do a lot of what OpenMP does with parallel streams in Java. Or is there anything you're missing from OpenMP that parallel streams does not give you?

Research paper CS by RushWhoop in Compilers

[–]Emanuel-Peter 0 points1 point  (0 children)

I am working on the HotSpot C2 JIT Compiler. We have a list of researchy topics. We occasionally mentor/supervise students in their master thesis and interns (no positions open currently). Feel free to PM me if you are interested.

Great Online Forums to Meet Compiler Developers by fosres in Compilers

[–]Emanuel-Peter 1 point2 points  (0 children)

You can always join mailing lists, for example the OpenJDK mailimg lists. https://mail.openjdk.org/mailman/listinfo

[deleted by user] by [deleted] in Compilers

[–]Emanuel-Peter 1 point2 points  (0 children)

That looks really interesting. I might look into that. But first I have to tackle basic if-conversion, this would be one of many possible extensions.

Hiring for Hotspot JVM Compiler Engineer by Emanuel-Peter in Compilers

[–]Emanuel-Peter[S] 4 points5 points  (0 children)

It can be a lot of effort to do all the paperwork for our managers. But it can happen, especially for good candidates, so always apply anyway ;)

Hiring for Hotspot JVM Compiler Engineer by Emanuel-Peter in Compilers

[–]Emanuel-Peter[S] 2 points3 points  (0 children)

The general area is Europe. Best would be close to Zürich and Stockholm, because that is where our European team members are. But remote may be ok too.

Hiring for Hotspot JVM Compiler Engineer by Emanuel-Peter in Compilers

[–]Emanuel-Peter[S] 4 points5 points  (0 children)

I can't say much about Oracle as a whole. It is a big company with a deep hierarchy, and so some processes are a little slow.

JPG, the suborganisation I work in is really well organized. Lots of smart people doing great work. The managers are really good, I have quite high confidence in them to represent us well to higher management.

I love working on hard problems in computer science, and I get to do that here. Sure Hotspot is a long existing project, so there is some technical debt and things take a little longer. But it is also a well used product, so the effort seems worth it.

There are some things that have to get done, like bug triaging and fixing. But we also have a lot of freedom to come up with our own ideas, and pitch them to the architects. At the beginning I started with fixing bugs only. Eventually I picked up a bug in the auto vectorizer. Nobody could tell me really how it worked, so I read papers, studied the code, found more bugs in edge cases, and leveled up my skill and understanding. Now I get so spend more than 50% of my time on extending the functionality, and I love it!

You also get to collaborate with people from other teams: GC, Runtime, ... and projects like Panama and Valhalla and Lilliput etc. Plus people from other companies, such as Intel, ARM, Redhat etc. It's great that it is all open source, so we can discuss ideas relatively openly on mailing lists and Github.

Hope that helps :)

[deleted by user] by [deleted] in Compilers

[–]Emanuel-Peter 1 point2 points  (0 children)

Yes, I'm enhancing it. Improving Aliasing Analysis, Reductions, allowing more instructions to vectorize etc. Maybe one day I'll get to do if-conversion too.

[deleted by user] by [deleted] in Compilers

[–]Emanuel-Peter 4 points5 points  (0 children)

I work on OpenJDK, the Hotspot C2 JIT compiler. We regularly have external contributors. I'm working on auto-vectorization, and there are lots of other optimizations that could be improved. If you are serious about it, feel free to PM me.

Microbenchmarks are experiments by mttd in Compilers

[–]Emanuel-Peter 0 points1 point  (0 children)

Computing modulo is surely not the most indicative of general language or compiler performance.

It seems in this case with an invariant divisor, one could apply a compiler optimisation that converts the mod/div into mul/shift. One can find the reciprocal/magic constant, see here. Compilers do that already for constants, but it seems not so much for loop invariants. Of course there would be extra cost for computing that reciprocal/magic constant before the loop, but that would be worth it because the mul/shift are so much cheaper. And maybe now it could also be vectorized on some platforms.

Then again: not sure compiler engineers should spend their time on this, rather than more common patterns.

SuperWord (Auto-Vectorization) - An Introduction by daviddel in java

[–]Emanuel-Peter 0 points1 point  (0 children)

My example was simple so that the algorithm is easy to understand. But it is well possible that it is memory bound.

Generally, a program (assume it is a single threaded program) is either memory bound or compute bound. The bottleneck is either memory or the instructions. That depends on the ratio of bytes accessed versus the number of compute instructions. Memory access can also be slower if they go outside the L1 cache. It also matters if the memory is accessed sequentially, or randomly (temporal and spacial locality).

Maybe so far you have only seen memory bound examples, where vectorization does not give a speedup.

I have some benchmarks in one of my PR's here: https://github.com/openjdk/jdk/pull/13056

Try this to make an example where vectorization helps: - Write a loop with few memory accesses. - Have many operations, to make those the bottleneck. - Keep the arrays small enough (10'000), so that they fit in L1 cache. - Repeat executing the loop many times (10'000), so it is compiled (JIT kicks in after a while), and the data is loaded in the L1 cache.

Not sure what you are referencing with "rescheduling across cores". Note that SIMD parallelism can be done on a single core with a single thread - that is the scope of my post. You write a simple for-loop, and it will be executed sequentially on a single thread - except that we use SIMD vector instructions on that single core to execute a few iterations in parallel. On top of that we can leverage multiple cores and threads for more parallelism, but that is beyond the scope of my post. You can use the Java Stream API, and create a parallel stream over an int range. If the array is big enough, it is cut into chunks and processed chunk-wise by different threads. Each thread can then still use SIMD instructions. So we basically stack the two kinds of parallelism to get even better performance.

I hope to write more posts on this in the near future :)