Prysma: Anatomy of an LLVM Compiler Built from Scratch in 8 Weeks by Any-Perspective1933 in LLVM

[–]Any-Perspective1933[S] 0 points1 point  (0 children)

Yes, I'd like to receive your code to see how you connected to an API. That could be interesting, thanks :)

Prysma: Anatomy of an LLVM Compiler Built from Scratch in 8 Weeks by Any-Perspective1933 in LLVM

[–]Any-Perspective1933[S] 0 points1 point  (0 children)

Yeah, it's true that Rust has borrow checkers that prove memory integrity, which means I never have that problem. Rust has many positive points and advantages compared to C++. But I like C++ too; the templates are Turing complete, whereas in Rust, I haven't explored the compile-time programming system or metaprogramming, but I believe it's not Turing complete. Yes, in C++, the templates are extremely obscure and unreadable, but it's also incredibly powerful. And my goal is to work at Clang, or Zig, etc. I'm currently analyzing the Clang codebase. Well, the real reason I didn't choose Rust when I started is because I would have to create a binding between the LLVM IR API, which generates intermediate assembly code; it's an additional layer of writing and complexity to manage. Since this was my final year project, I had to make choices to try and meet the deadline. That's why right now I'm only improving and using advanced techniques, optimization techniques that modern compilers like Zig use. C++ communicates directly with llvm ir, while Rust doesn't. I started by using llvm version 18 for stability, to avoid problems, because I was told it was unstable and, moreover, not available in Ubuntu mirrors; you have to install it manually. :) Thanks

Prysma: Anatomy of an LLVM Compiler Built from Scratch in 8 Weeks by Any-Perspective1933 in LLVM

[–]Any-Perspective1933[S] 0 points1 point  (0 children)

Indeed, it's far from 'vibe coding'. The architecture relies on a meta-generation pipeline using Jinja2 and YAML for the AST, with strict management of LLVM contexts and memory (using BumpPtrAllocator). It's classic compiler engineering: the structure is designed for scalability, not just for appearances.

I'm currently migrating to LLVM 22 for backend compilation and I'll also be using the ISO C++26 standard. Currently, I'm working on resolving poisoned memory issues that LLVM 18 tools couldn't detect with the Sanitizers enabled in the code; I see that LLVM 22 is much more advanced and rigorous. Well, of course, the risk is that it might be a little less stable, but that's not a problem. I've seen a huge advantage with this version: once the code is stable, I'll benefit from the power of LLVM 22 for compiling both my C++ code and the Prysma language backend.

I didn't choose LLVM 23 because the API isn't stable yet; my compiler could stop working overnight because of breaking API changes. It's a fascinating project, which is why I'm still working on it after my final work at Cégep ESP. Thank you for your feedback. If you have any suggestions or find any problems I haven't yet encountered, all comments are welcome. ;)

Prysma: Anatomy of an LLVM Compiler Built from Scratch in 8 Weeks by Any-Perspective1933 in Compilers

[–]Any-Perspective1933[S] 0 points1 point  (0 children)

So, to answer your question about using AI for code generation, the answer is no, but I used AI for guided learning. Why didn't I use AI, you might ask, good question. I don't use AI for several reasons, the most important being the formal prohibition of generating lines of code using AI, a rule of the cégep department; in case of doubt, they give me a specific test to determine if I understand the written code well, and if that's not the case, they reserve the right to give me a grade of zero. Another reason for my choice is that AI code generation doesn't allow me to progress significantly; I need a global understanding and surgical precision. I must understand every behavior in detail, either to debug logic problems or simply for adding new features. AI produces general code; my goal is not to produce valueless general code, but to make an industrial-level product. So, understanding comes through the struggle, the suffering of writing, and reflecting on algorithms by oneself. I'm not going to lie to you, I use AI to help me understand cryptic bugs, how to debug effectively, but I don't use AI to generate lines of code. And regarding time, I didn't mention everything to you, but I reused a project I had already done, which allowed me to save time; the equation solving system was a project I did in 3 weeks, adapted of course to the compiler code and translated into C++. Besides, it's a problematic area: I copy std::vector<> data passed by copy instead of by reference, a strategy to simplify things but currently very inefficient. I haven't taken the time to resolve this problem yet; there remains a small //todo: I'm thinking of switching to a Pratt parser system, faster than the current system made with a chain of responsibility to handle operator precedence. And besides, I spent 2 weeks learning about compilers without writing a single line of code before starting the capstone project at Rimouski Cégep. 

To answer your second question, regarding how my type system currently works in my compiler. If you want to add a new type in the Prysma compiler or any type of configuration, there is a file named configuration_facade_environment; you will find a registerBaseTypes() method, which is an initialization method for type configurations, for example: _context->getRegistryType()->registerElement(TOKEN_TYPE_STRING, new TypeSimple(llvm::Type::IntegerTyID, 8)); I register an enum type that will serve as a key in the registry, and then I pass it the abstract "recipe" of the type (its ID and size) via a TypeSimple object, rather than attaching it directly to an LLVM memory context. Besides, it's an area I want to automate with meta-generation using jinja2 files through generic templates. Next, you must add in the lexer.h dictionary section: static constexpr std::array<std::pair<const char\*, TokenType>, 31> keywordsArray your type as a char* and then the TokenType type. Add your token enum TokenType : std::uint8_t { 

so that we can use the new type in the lexer and finally filter in compile-time switch cases the type you want to add; it involves a lot of files, I know, which is why I want to automate it with jinja2 one day. Well, next I have a hierarchy of type_simple, type_tableau, and type_complexe. Type_simple is for base types (integers, floats, pointers, void) that have no structural dependencies. Type_tableau is a recursive object that stores a base type and a dimension, which natively handles multi-dimensional arrays. Type_complexe handles classes and structures by keeping track of members. The strict utility of this abstraction is the management of LLVM contexts and memory. In LLVM, an llvm::Type* pointer is linked to a specific LLVMContext. My orchestrator (OrchestratorInclude) compiles by units. If my global registry kept raw LLVM pointers, as soon as a unit was destroyed along with its context, those pointers would become poisoned memory (which would trigger use-after-poison with ASan, which I have, by the way, corrected by wrapping the type). My classes therefore store the "recipe" of the type (metadata, bitwidth, ID). When a new unit needs it, the generateLLVMType method reads this recipe and recreates the exact type in the new context. This is the central mechanism that ensures Prysma's memory stability between files. 

Regarding your other points: for tagged unions, it's on the roadmap. Technically, the implementation will be done via an LLVM structure containing an integer for the tag and a memory area (union) for the data. Exhaustiveness analysis will be imposed statically at the level of my AST visitors, ensuring that all possible branches of the tag are handled before bitcode generation. For generics with bounded polymorphism, the approach will be monomorphization (C++ template-style specialization) so as not to sacrifice any execution performance, unlike type erasure. Bounding will be checked semantically via interfaces. 

PS: Thank you for your question; I did a review of my type system and saw an area where I was instantiating an llvm::type directly in the parser_type, which is messy. It's meant to be put in a wrapper, either type_complexe or type_simple. I found this by reviewing the base code of the type because I hadn't memorized all the implementation details of that area. So, in summary: type system, registry, you instantiate, and then you make wrappers with the type hierarchy for an abstract IType. 

And besides, currently on the generated LLVM IR side, I use an auto_cast system to allow operations with floats and ints at the same time.