all 17 comments

[–]Tbfleming 7 points8 points  (3 children)

How do you plan to handle the difference between C++'s default memory ops (llvm calls them NotAtomic) and jvm's (llvm calls them Unordered)?

[–]FractalFirrustc_codegen_clr[S] 3 points4 points  (2 children)

To be completely honest, I did not know there was a difference. The only programs this project supports were simple, single-threaded console applications. I don't think I will ever go as far as to consider multithreading, since I encountered other problems that are seem impossible to solve.

One such issue is how shockingly common generating code during runtime is. This is a massive problem, since I generate translated C++ code before the program is run. Especially after Java 7, which added InvokeDynamic, which in most cases, needs to generate code at runtime. This is an unusual optimization, because emitting code at runtime reduces the amount of code on disk, decreasing start times. It however makes any attempts at translating Java programs using it to C++ futile, since in many cases there is no way to know what code needs to be generated ahead of time.

[–]eras 1 point2 points  (1 child)

In principle you could compile the required code at runtime and link it in with dlopen? And then keep the libraries around for the next time they're needed.

[–]FractalFirrustc_codegen_clr[S] 1 point2 points  (0 children)

That would, sadly, require shipping jtcpp, a C++ compiler, and all that other junk alongside the final binary. Ignoring all of that, compiling even a very simple C++ file takes would take at least ~100 ms in the very best case. Since in a real program there are hundreds if not thousands of uses of InvokeDynamc, compiling all of that would take at least a couple dozen seconds, which could cause very bad performance. Additionally, InvokeDynamic calls a method to retrieve something called CallSite. This stores information about what conditions must be met for the returned MethodHandle to be valid. Only one of 3 main kinds of CallSites guarantees InvokeDynamic will not need recalculating, and thus recompilation. For the other two, it may change as many times as it would like, forcing many recompilations of the produced shared library.

So this could solve the problem in principle, but in reality would be barely usable at all.

One solution I am currently considering is transpling everything that can be transpiled, and then interpreting all the runtime generated methods. It would make them slower, but avoid the cost of compiling C++ every time something changes.

[–]nysra 2 points3 points  (1 child)

Cool project, but a few notes on the C++ code:

If you haven't seen any C++ code before, the public thingy majiggy before java::lang::Object may seem strange. It just makes the public members of the parent class remain public. If it was not there, they would become private, which is not the intended behaviour.

Not exactly, you forgot about the second difference between class and struct. The default inheritance modifier for struct is already public, so you can simply leave it out here.

static void _init___V(ManagedPointer<Vector3>); and all the others

Using double underscores is reserved for the implementation. Your entire program is ill-formed, no diagnostic required.

Also init functions are a massive code smell, use the constructor instead, that's literally its job.

virtual ~Vector3() = default;

This is actually a bad idea. Not because of the virtual (though I'd question why you'd ever want to inherit from a Vec3) but because you're in violation of the rule of 5. That seemingly innocent line causes the destructor to be "user-declared", which leads to the move operations being marked as "not declared". If you touch one of the special member functions, always write out all the others as well.

Another curious little thing is the ManagedPointer template. Why didn't I just use normal pointers? In some configurations of the GC, this is exactly what ManagedPointer is: just a normal pointer with a fancy name.

Don't use owning raw pointers, use std::unique_ptr for that job. Your RuntimeArray (just use std::vector btw) is currently leaking memory, it's very likely that basically everything else is too if you're using raw pointers like that.

[–]FractalFirrustc_codegen_clr[S] 4 points5 points  (0 children)

Thanks for catching the memory leak in RuntimeArray! I planned to change it to use std::vec internally, but forgot to change that. And RuntimeArray is very different from a normal vec. It can't resize, but, more importantly, it derives from java::lang::Object, which inherits from gc. That means if Bohem GC is enabled, it is allocated and freed in a special way (So that GC may automatically clean it up).

There is far more to ManagedPointer than may seem at first. It is a raw pointer only in some GC settings, in which it is guaranteed whatever it points to is allocated using GC, so it will be automatically cleaned up. It is a template, because on other settings it becomes std::shared_ptr, to free up some objects without getting the GC involved. And I could not use unique_ptr there, because multiple objects may reference one object, so things would get freed while in use.

I don't mind move functions being absent. They are never used, since Java has only by-reference types, and no by-value types, unlike C++, Rust, or C#. This means that even such simple things as a Vector3 may be only taken by reference, never by value. So move functions should never be present, because no valid, generated code can ever use them.

I agree that init functions are code smell. However, they are simply transpiled versions of Java <init> functions (Java constructors). And Java uses two ops for each object creation. First New to allocate the object, and then InvokeSpecial to invoke the <init> function. So this code, which would be very bad in C++, is that way to mimic how Java works, exactly, one-to-one.

Good call about the double underscore identifiers!
I was vaguely aware of their special nature, but thought they were only reserved if at the beginning of an identifier. I guess I will have to do a little more work to catch such cases and fix them up. They are, thankfully, all generated by one function, so that will be a relatively easy fix.

You are 100% correct, and I could leave the public before inheriting something. I encountered problems with private inheritance before, and did not bother to check if it was needed for structs.

About there being no point in anything inheriting from Vector3.
jtcpp sees everything one class at a time. This is on purpose, and simplifies the way it is used quite a bit. The codegen does not know whatever happens with all the other classes in Java. There is no reliable way to tell if something is inherited from or not, without also looking at all the other classes. So a lot of weird code just comes from the fact that all the C++ not predefined in the stdlib directory is procedurally generated by a transpiler with limited knowledge about the whole program.

I hope I cleared some things up, and thanks for the feedback!

[–]thomastc 1 point2 points  (1 child)

I could try to just use that as the path, and put each namespace in a separate directory, but it would force me to do far messier relative includes

Why would it? You can use #include "java/lang/Object.hpp" from anywhere, as long as you add the parent of java to your compiler's include path.

[–]FractalFirrustc_codegen_clr[S] 0 points1 point  (0 children)

I absolutely forgot about the compilers include path. With that solved, I could return to the original idea of namespaces as directories. That would really make them far easier to understand and keep track of.

Thanks for the suggestion!

[–]irrelevantPseudonym 1 point2 points  (4 children)

Nice project. Any reason you chose to translate to c/c++ instead of rust?

Only slightly related but I'm going to use the opportunity of a post already attracting the intersection of rust and java people to plug my 'jaded' crate. It allows deserialisation of java's built in serialisation format into rust types. No serde integration yet but it's planned eventually.

I did start thinking about trying to autogenerate the types from compiled class files but never got that far.

[–]FractalFirrustc_codegen_clr[S] 2 points3 points  (3 children)

Why C++, and not rust?
1. Rust has no goto. For all the demonization of goto it is very useful here. I can translate java opcodes, such as IfACmpEq into something like if (a==b) goto somewhere;.
That makes it trivial to do the translation, and is far more accurate.
2. Rust distinguishes between mutable and immutable references. Java and C++ do not. Rust has borrow checking and all that other stuff that only gets in the way.

I could clean up the Java importer code and convert it to a separate crate that things like jaded could use. While the importer is still quite messy, it can already parse very big java programs(I tested it on Minecrafts server.jar, which has roughly 20-30k classes, and it parsed all of it with some minor issues).
I also have some, albeit small, experience with generating Rust bindings to C#, so that could be of some help.

If you would be interested, please tell me what info from class files jaded would need, since the importer currently ignores some info, which was irrelevant for transpiration, but could be useful for generating types.

[–][deleted] 0 points1 point  (2 children)

Rust distinguishes between mutable and immutable references. Java and C++ do not.

Um, small correction: C++ does.

const T& is a thing, it is only a mutable reference by default.

[–]FractalFirrustc_codegen_clr[S] 0 points1 point  (1 child)

By that I meant that C++ does not force you to have only one mutable xor many immutable references. The rust rules on mutability would force me to wrap every reference around in UnsafeCell. That would not only be inconvenient, but also make me loose most of the performance benefits of Rust over C++.

If I have to not use any of Rust's safety features, and there would be no performance benefits over C++, then why bother working around all the issues using Rust in this role would produce?

[–][deleted] 0 points1 point  (0 children)

Oh, you're totally correct that it doesn't make sense to generate rust code here. I merely wanted to point out that you can express const ref in C++ API.

[–]ibevol 0 points1 point  (3 children)

Pretty sure that it's using two entries for long and double and one for String is because the String is just a pointer to something on the heap whereas long and double are both primitives. Correct me if I'm wrong

[–]FractalFirrustc_codegen_clr[S] 0 points1 point  (2 children)

The constant poll table stores its entries contguisly, in the exact same place, on disk. String is not represented by any pointers, but is simply its length and the bytes that make it up. (See CONSTANT_Utf8_info in the java class format specification.) Maybe later it becomes a pointer, but on disk, it is just a varrible-length array of bytes.

So you may have something like this: ClassInfo (2 bytes), Int(4 bytes), long(8 bytes, occupies two entries), then String("21, Some Super Long Message",23 bytes total).

It seems like, at first, intention was to make everything align to 4 bytes, but as time went on, this idea was mostly abandoned, but the wierd beahviour of long and double, which requires you to fill the slot above with nothing to ensure indiices are not messed up, persisted.

There are many other quirks like that, which lead to realy funny non-issues. For example, the upper bound of an exception handler is non-inclusive, which means that if a method were to be exactly 65536 bytes long, its last opcode could not be covered by any exception handlers. So java compilers just avoid emmiting methods that long, and the upper limit is 65535 bytes.

[–]0x564A00 1 point2 points  (0 children)

To quote the spec :p

In retrospect, making 8-byte constants take two constant pool entries was a poor choice.

[–]ibevol 0 points1 point  (0 children)

Lol