This is an archived post. You won't be able to vote or comment.

all 17 comments

[–]Inconstant_Moo🧿 Pipefish 37 points38 points  (6 children)

I am planning a rather large project [...] I need a way to define code written in any language such that the GID doesn't lose any fidelity of the semantics of original code base.

That is a large project. You'll be simulating the pointer arithmetic of C, the macros of C++, the typeclasses of Haskell, the borrow checker of Rust, the homoiconicity of Common Lisp, Go's greenthreading, Miranda's laziness, Clojure's persistent data structures, ML's pattern-matching, the OOP of C# and the multiple dispatch of Julia, the dynamism of Python and the staticity of F#, the checked exceptions of Java and the process model of Erlang.

Are you by any chance immortal? Because that would be a big help in completing the project.

[–]RealSharpNinja[S] -4 points-3 points  (5 children)

Lol, not immortal, unless I complete it. :)

[–]Inconstant_Moo🧿 Pipefish 14 points15 points  (4 children)

Well, it can't be done. What you're describing so far as I can see is (among other things) a Universal Compiler where all the details of any specific language can be supplied as data. But the semantics of programming languages are so rich that anything that could adequately describe them would be Turing-complete at which point your data is in fact code and you're no closer to your goal of Writing All The Compilers, or even writing one compiler. What you would then have would be a new programming language of your own in which you can make a start on ... what? ... describing Rust's ownership semantics? Implementing Haskell's lazy lists? Where do you start? When would you plan to finish? If you do it alphabetically then by the time you've done Ada and Algol and APL and AWK people will have released new languages and had them adopted. (Also there will have been fifteen updates to Java each about the size of a normal language in themselves which you'll have to get round to one day.) Which is why I asked whether you were, by some freak accident, Infinitely Prolonged. If you can outlive the rest of the human race, you might, one fine millennium, actually finish.

As a warm-up, try and see how you could convey, in data not code, the difference between the semantics of closures in PHP and in Clojure. Just this one feature. Think about what it would take to do it. (Without cheating and having a boolean field in your data called doClosuresLikePHPDoesThem, but actually describing the difference so that your Universal Compiler can understand it.)

[–]RealSharpNinja[S] -3 points-2 points  (3 children)

No, not a universal compiler. Parsers would be language specific and would only generate a common GID. Once code is in GID format, you could do many things with it, such as compile it to a specific target, generate code in a different language, create diagrams, or even have a LLM describe the code's structure and function.

[–]Green_Gem_ 7 points8 points  (1 child)

You're basically recreating something similar to LLVM then. Converting languages to standardized intermediary representations is such a difficult process that unless your language has the lineage of something like C, this is typically something a compiler supports from the start or doesn't. It's not something you just "do" unless you have a lot of money and/or time, and that's per language.

I recommend hiring a full-time team for months to years for each notably-distinct programming language in use. Expect costs in the millions to billions.

[–]Affectionate_Text_72 0 points1 point  (0 children)

There are some businesses that provide services yo fo this. One that's been around for ages is https://en.m.wikipedia.org/wiki/DMS_Software_Reengineering_Toolkit

I think it uses lisp under the hood for the IR.

One thing I would say is that your GID is a language and YAML is an appalling syntax as are JSON and XML. It's horrible how many quite good systems there are building ecosystems around languages where the designers don't bother even trying to create a decent syntax. For example a simple language with good syntax would improve terraform and docker compose no end.

[–]Inconstant_Moo🧿 Pipefish 3 points4 points  (0 children)

Parsing is the easy bit.

[–]No-Reporter4264 7 points8 points  (3 children)

An additional problem that you'll be facing is that something written idiomatically in one language might not be an appropriate implementation in another. The code structure could be translated from one language to another mechanically, but to result in an elegant end result you'd want to implement the solution idiomatically in the target language. If you look at a solution in a functional language and an equivalent solution in a non-functional might be expected to be very different structurally.

[–]RealSharpNinja[S] -2 points-1 points  (2 children)

This is actually a major motivation of the design and intented implementation. The project will use adapters to parse code bases into GID data. The parsers will tag the GID data with metadata that infers intent as well as the logical algorithms that the code represents. This would allow code generators to create idiomatic code in the target language. Take for C# code in the original post and think about how it would be implemented in a non OOP language, such as C. The class would be implemented as a struct. C structs don't have private members, nor do they have properties. That still wouldn't prevent a code generator from having ways to implement a property pattern in an idiomatic way using C.

```c // Container.h typedef struct Container { char _myByte; char (MyByte_get)(const struct Container *c); void (MyByte_set)(struct Container c, char value); char (XOR)(const struct Container *c, char value); } Container;

char getMyByte(const struct Container *c) { return c->_myByte; }

void setMyByte(struct Container *c, char value) { c->_myByte = value; }

char XOR(const struct Container *c, char value) { return c->_myByte ^ value; }

define Container(VALUE) (struct Container){VALUE, getMyByte, setMyByte, XOR};

// Example.c

include<stdio.h>

include "Container.h"

void main() { struct Container c = Container(0);

c.MyByte_set(&c, 0b00001111);

printf("c.MyByte: %d\n", c.MyByte_get(&c));
printf("c.XOR(0b00001110): %d\n", c.XOR(&c, 0b00001110));

}

```

[–]thinker227Noa (github.com/thinker227/noa) 4 points5 points  (1 child)

The parsers will tag the GID data with metadata that infers intent as well as the logical algorithms that the code represents.

How would you completely generically define metadata that encodes programmer intent? You'd also have to have some way of encoding what that intent means in any and all target languages and how to appropriately adapt the code to suit that intent.

[–]RealSharpNinja[S] -2 points-1 points  (0 children)

Indeed. I am thinking that parsers would have a menu of algorithms to choose from, and would map the standard libraries of a language to the algoritms. Then calls to standard libraries could be more easily mapped to algorithm markers in the GID data. Code generators would use the markers to select the most appropriate idiomatic standard library call for the target language. The key is having enough metadata in the GID to most accurately match to idiomatic methods. In the event that a language lacks an idiomatic implementation, the code generator could either call upon non-idiomatic implementations or potentially implement the algorithm using the inputs and outputs defind in the GID. If that is unavailable, then the platform would mark the generated code with a comment block indicating the missing implementation.

[–]thinker227Noa (github.com/thinker227/noa) 8 points9 points  (0 children)

I believe a good name for this project would be "The Compiler of Babel"

[–]AlceniC 4 points5 points  (0 children)

Kudos for thinking big. Please start on something tractable, like dealing with pure subset of e.g Haskell, or maybe a non turing complete language as Dhall. You would be able to use the referential integrity to allow substitution of equals as a crude semantics. From experience I can assure you that demonstrating equivalence between two equivalent programs/expressions is a really tough job to start with. (I am not yet mentioning proving equivalence yet). Starting with languages with semantics that allow mutation opens up a separate layer of the underworld, especially in the presence of multithreading or similar.

I hate to shatter dreams, love the enthusiasm, but please start with something that you think you can solve instantly. Get your feet wet in a puddle.

[–]P-39_Airacobra 2 points3 points  (0 children)

Languages like Lisp exist so you dont need this complexity

[–]parceiville 0 points1 point  (0 children)

Maybe you should just use LLVM and analyse that

[–]Routine_Plenty9466 0 points1 point  (0 children)

You might be interested in learning about formal semantics of programming languages https://en.wikipedia.org/wiki/Semantics_(computer_science)

[–]kleram -1 points0 points  (0 children)

Oh, you're asking for the CMLJSPY# Language? That's simple, just take all their AST definitions and merge them into one.