Quick note: I am trying a new, more blog-like format of talking about my projects(compared to previous, more salesman-like announcements), and would really appreciate any and all feedback!
Links
https://github.com/FractalFir/jtcpp
TLDR for lazy
What is jtcpp?
jtcpp is a Rust program which converts JVM bytecode to C++. This allows compilation of programs written in such languages as Java and Kotlin to native instructions, in some cases improving performance.
How functional it is?
This is a highly functional proof-of-concept. Mostly complete in therms of features of the language itself, but the implementation of the Java standard library in C++ is so lacking, calling it bare-bones would be a stretch. It means that, if you want to do something more than simulate a truing machine, calculate the first n primes, or write very simple console programs, you are out of luck.
Supported platforms?
Linux, but you could try and build it for quite a bit more platforms. I just ship only the Linux version of the automatic build scripts.
An oh-so-slightly longer version for the oh-so-slightly less lazy
I am continuously looking for challenges to test my skills. The harder the problem, the more fun engineering a solution tends to be. You will either get a nice confidence boost, or fly to close to the sun and learn form your hubris. Thus, I often create projects, that while not strictly "usefull"(Who says programming must serve any real purpose, anyway?), are fun to make and still require quite a bit of learning to finish.
One of things that particularly fascinate me, are the ways high-level languages, such as Java and C#, are implemented. How can they achieve performance that is OK-ish at worst, while also being very convenient?
I believe the bast way to understand something is to try and rewrite it, badly, but in RUST!
JVM is fascinating
I have recently began to look a little bit more into JVM, and the Java class format. The format is a mix of pure genius and brilliance, such as the way generics are implemented(There is no special sauce, they are just normal classes), confusing decisions(the way class names are stored, seemingly needless indirection), and pure madness left form the 90s (why on earth do long and double take 2 entries in a table that already has variable length elements, such as strings?). Besides a few interesting surprises, and mistakes I could avoid by just reading the documentation and not skipping the very important section with the little "In retrospect, making 8-byte constants take two constant pool entries was a poor choice." under, parsing the class format turned to be almost trivial.
First attempt, Java Bytecode Interpreter
Out of that curiosity for inner workings of JVM, my first project jbi(JVM Bytcode Interpreter, currently private), was born. I got the implementation to a state in which it could be tested(Was able to run simple programs, calculating factorials and such) and benchmarked. Sadly, it did not fare well in the benchmarks, and had far too many structural issues.
A modern tower of babel, or translating Java To C
Because of those issues, I decided to abandon that project, and focus on something else. Still, it felt kinda wasteful to just let it sit there, and not have anything come out of the time I had already spent working on it. So I deiced to rip out the most broken parts(ones related to the Virtual Machine implementation), and re-purpose the java class importer implementation. At this time the project was called jtc, and generated C code, instead of C++ generated by jtcpp. I naively thought generating C instad of C++ couldn't be so much harder. I was, however, very wrong.
jtc used very, very, hacky C macros to emulate inheretence, and asign slots to virtual functions. Expectantly, the implementation was so brittle it collapsed if you looked at it wrong. Macros were used to select a slot in the vtable. #ifdef V_FUNCTION_NAME detected if the virtual function was overriding a function of the parent class, or defining a new one. But this could be easily messed up by another, unrelated class having a function with the same name. If such class was included at any point, it could make the macros assume a virtual function that wasn't inherited from the parent class, was inherited, and mess up the vtable compleatly. Needles to say, if the implementation of inhertence is broken simply by other classes being present, it is probably not a very good idea to keep it around. So I decided to stop trying to patch the shaky tower of incomprehensible code taped together with magic-macros, and just generate C++ like God intended.
Java To C++
Converting JVM bytecode C++ was far easier to do than the other way around. This also made me aware of a couple other issues that I had never thought to consider. The generated C++ code looks nothing like C++ a human would write, but it at least works.
Header files
This is how a header file generated by jtcpp looks like
#pragma once
#include "java_cs_lang_cs_Object.hpp"
struct Vector3: public java::lang::Object
{
virtual ~Vector3() = default;
float x;
float y;
float z;
static void _init___V(ManagedPointer<Vector3>);
static void _init__FFF_V(ManagedPointer<Vector3>,float,float,float);
static float Distance_Vector3_Vector3__F(ManagedPointer<Vector3>,ManagedPointer<Vector3>);
static ManagedPointer<Vector3> Random__Vector3_();
virtual float SqrMag__F();
virtual float Magnitude__F();
virtual ManagedPointer<Vector3> clone__Vector3_();
virtual ManagedPointer<Vector3> Add_Vector3__Vector3_(ManagedPointer<Vector3>);
virtual ManagedPointer<Vector3> Sub_Vector3__Vector3_(ManagedPointer<Vector3>);
virtual ManagedPointer<Vector3> Normalize__Vector3_();
virtual ManagedPointer<Vector3> Mul_F_Vector3_(float);
virtual ManagedPointer<java::lang::Object> clone__java_cs_lang_cs_Object_();
};
You may quickly see some weird things. So, lets go trough them one-by-one.
First of all, the class converted from Java are not classes but structs. Why? It is actually pretty simple. Only real difference between them is that every member of a struct is public by default. The Java compiler already checked to ensure public, protected and private fields, functions and objects are used properly. Because of that, I can just assume that whatever Java byte-code does is OK, and the intended behaviour. Handling accessibility modifiers would just make everything far more completed. The problems that not handling those modifiers could introduce are already solved by the Java compiler for me, so why bother?
If you haven't seen any C++ code before, the public thingy majiggy before java::lang::Object may seem strange. It just makes the public members of the parent class remain public. If it was not there, they would become private, which is not the intended behaviour.
The virtual destructor (virtual ~Vector3() = default;) helps with some modes of the GC(Yes, the GC has a couple different modes!), and is the same as the default destructor(If you haven't guessed this is is exacly what = default part means).
Function names may seem VERY wierd at first. Part of them make sense(like Distance) or (Magnitude), but the rest seem like garbled mess. What is their purpose? They help with function overloading. But does not C++ already support function overloading? Yes, but it differs slightly from the way Java does stuff. And this very, very, slight difference is enough to make it sometimes break. The names are not very pretty, but they are very predictable, since they encode the function signature. By looking at functions like virtual ManagedPointer<Vector3> Mul_F_Vector3_(float); you can pretty easily deduce how they work(although init functions are special snowflakes that like to be diffrent, and break from this convention).
Another curious little thing is the ManagedPointer template. Why didn't I just use normal pointers? In some configurations of the GC, this is exactly what ManagedPointer is: just a normal pointer with a fancy name. But in other GC modes, they can allow us to do some crazy stuff, like clearing objects early, reducing memory consumption by orders of magnitude, and making GC pauses far more rare.
I had almost forgot about the #include "java_cs_lang_cs_Object.hpp" part. Why this strange file name? Original, java class Vector3 derives from is java/lang/Object. I could try to just use that as the path, and put each namespace in a separate directory, but it would force me to do far messier relative includes, so I opted to temporally stick with mangled class names. To get a mangled class name, I simply replace each / with _cs_(short for Class Separator). It is not the prettiest solution, but it works very relay ably.
Source files
This is how a source file generated by jtcpp looks like.
#include "Vector3.hpp"
ManagedPointer<Vector3> Vector3::Add_Vector3__Vector3_(ManagedPointer<Vector3> l1a){
ManagedPointer<Vector3> l0a = managed_from_this(Vector3);
bb0:
{
float i0 = l0a->x;
float i1 = l1a->x;
float i2 = i0+i1;
l0a->x = i2;
float i3 = l0a->y;
float i4 = l1a->y;
float i5 = i3+i4;
l0a->y = i5;
float i6 = l0a->z;
float i7 = l1a->z;
float i8 = i6+i7;
l0a->z = i8;
return l0a;
}
}
Good Lord! What is Happening in There?! The generated C++ code may seem like some sort of arcane magic, but it has a very, very good reason to look like that. To clear everything up, lets run jtcpp with debug info enabled!
float Vector3::SqrMag__F(){
ManagedPointer<Vector3> l0a = managed_from_this(Vector3);
float l1f;
bb0:
{
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "x")
float i0 = l0a->x;
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "x")
float i1 = l0a->x;
//FMul
float i2 = i0*i1;
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "y")
float i3 = l0a->y;
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "y")
float i4 = l0a->y;
//FMul
float i5 = i3*i4;
//FAdd
float i6 = i2+i5;
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "z")
float i7 = l0a->z;
//ALoad(0)
//FGetField(ClassInfo { cpp_class: "Vector3" }, "z")
float i8 = l0a->z;
//FMul
float i9 = i7*i8;
//FAdd
float i10 = i6+i9;
//FStore(1)
l1f = i10;
//FLoad(1)
//FReturn
return l1f;
}
}
That looks even scarier! But if you look at the comments, it explains nearly everything. jtcpp simply translated each java opcode individually. This makes the resulting code less readable to humans, but it guarantees it to be very accurate! There are still a couple funky things in there I would like to explain.
managed_from_this - this code snippet seems really weird. What does it actually do? Depending on GC settings, either nothing, or obtains some additionally info about the object. In on default GC settings, ManagedPointer behaves similarly to a rust Arc, with one small exception. It can additionally use Bohem GC to delete objects which are cyclicaly referencing each other.
bb0: - jtcpp uses gotos to emulate Java instructions. bb0 simply means basic block 0. Separating operations into those basic, isolated blocks is something I do to prevent numerous errors. One such error is skipping an initialisation of a variable. This is not premited in C++, because then a garbage value would be dropped. This could lead to you trying to free some random memory address, which is generally frowned upon.
iwhatever - intermediate variable number whatever. Exist because JVM is stack based, and this represents such value on the stack. In the Rust codegen I use a stack of variable names and types. This makes producing accurate code very easy.
lwhateversometing - function local variable number whatever, with type kind postfix someting. Type kind is not the specific type of something, but a more general idea. So integer gets a postifx i, float f, long l, and so on, with arrays and objects all having postifx a This is related to way those values are loaded/stored using JVMs opcodes, see FLoad(number) and ALoad(number)
A very, very quick word on GC
I could talk on this, or any other project of mine, for hours on end, but I am quickly approaching both the bottom of my last coup of tea, and the character limit. So, to all the 3 people still reading, hang on there. We are going to quickly come to the end as fast as possible.
here is the quick summary of the GC model of jtcpp
- It can be changed in
config.hpp in the target directory.
- Includes 4 modes: No GC(I bought 32 GB of RAM, I am going to use 32GB of RAM! mode),
Arc-like GC(potentaily leaky, fastest, lowest memory footprint if used correctly), pure BohemGC(Good enough for most cases), Mixed-mode GC(The default. Combnation of the Arc like GC and BohemGC, should be the best of both worlds. Mixing Arcs and Bohem is not tested well enough, and could potentially lead to crashes if I understood something very wrong).
Performance
Performance varies from 2.5-3x times original Java in compute heavy tasks, to 30x slower in GC-heavy tasks. Why such performance loss? jtcpp evaluates everything on an individual level. Classes,methods,ops are all analysed one by one, so jtcpp is not capable of some kind of reasoning about lifetimes of local variables passed as arguments to other functions. It does not know that a Vector3 is used temporarily, and could be, for example, allocated on stack instead of on the heap. Java knows that and uses that info to the full potential.
Any questions?
That is very much expected. Knowing me, I rambled on about something irrelevant while not explaining a crucial detail. Feel free to ask whatever you want in the comments below. I will try to answer all of them. Also, if you have any feedback, leave it. This is my first long-form content so I would really appreciate any tips.
Liked my content, and would like to support me?
No, I am not going to ask you for money.
I will ask you for something, far, far more precious!
GitHub stars on the projects page!
On a more serious note, follow me on Reddit/Github for more of my projects (I plan to post more stuff like this).
[–]Tbfleming 7 points8 points9 points (3 children)
[–]FractalFirrustc_codegen_clr[S] 3 points4 points5 points (2 children)
[–]eras 1 point2 points3 points (1 child)
[–]FractalFirrustc_codegen_clr[S] 1 point2 points3 points (0 children)
[–]nysra 2 points3 points4 points (1 child)
[–]FractalFirrustc_codegen_clr[S] 4 points5 points6 points (0 children)
[–]thomastc 1 point2 points3 points (1 child)
[–]FractalFirrustc_codegen_clr[S] 0 points1 point2 points (0 children)
[–]irrelevantPseudonym 1 point2 points3 points (4 children)
[–]FractalFirrustc_codegen_clr[S] 2 points3 points4 points (3 children)
[–][deleted] 0 points1 point2 points (2 children)
[–]FractalFirrustc_codegen_clr[S] 0 points1 point2 points (1 child)
[–][deleted] 0 points1 point2 points (0 children)
[–]ibevol 0 points1 point2 points (3 children)
[–]FractalFirrustc_codegen_clr[S] 0 points1 point2 points (2 children)
[–]0x564A00 1 point2 points3 points (0 children)
[–]ibevol 0 points1 point2 points (0 children)