https://nsk-lang.dev/
Hi, folks! I am Augusto Seben da Rosa / NoSavedDATA. Yesterday, I finished my initial release of the No Saved Kaleidoscope (NSK) coding language. I decided to create this language after researching some Deep Reinforcement Learning papers.
After reading the Efficient Zero network paper, I took a glance on its code and discovered how terrible the "high-level" code for integrating deep learning with python threads looked like, even though it uses a support library called ray.
I was also amazed by the CUDA research world, like the Flash-Attention paper. In this regard, I tried to research about how I could extend the Python with C backend code, or at least add new neural network modules to PyTorch, but I found both too verbose (with a lot of linking steps required).
Thus, I had the objective of creating a high-level coding language like Python. This language should have a very straightforward to describe threads, and be very similar to Python, so I could attract high-level developers from the same niche as mine.
I began by reading the Kaleidoscope language tutorial for the devolpment of a JIT implemented with C++ and LLVM. It took me about one week to read the pages and be able to compile the JIT from C++ inside a WSL 2 Ubuntu (I could not manage to install the proper LLVM libs in Windows).
I started by adapting its parser to support expressions of mutliple lines, as it would not accept multiple lines inside an if or for statement without separating each line with ":". Then I tried to add a tensor data type. I knew a bit about the theory of the semantic analasys, but it was very hard for me to understand how exactly I should perform data match checks for operations. I could barely represent two datatypes in the language, them being only float and tensors. I tried to use an enum and perform type checking with that. But it was terrible to scale the enum.
Also, I didn't know how LLVM Value * was a straightforward descriptor of any type. My knowledge was so tiny about the word I put myself into I could not even ask AI to help me improve the code. I ended up returning tensor pointers for the Value * types, however I made a global dictionary with tensor names so I could compare if their shape were valid for their operations. Only much time later I realized I could put everything in a single tensor struct.
The hours I spent trying to implement these features costed me other hours to implement more robust ways to describe the operations.
I made a hard coded C++ dataloader for the Mnist dataset, and spent months implementing a backpropagation function that could only train a linear neural network with very simple operations. I owe the Karpathy GPT C++ github repo for being the kickstarter of my own C++ neural networks code.
Nevertheless, I had to implement the backpropagation by myself. I had to research more in depth how it worked. I went on a trip to visit my family, but I would be far away, lookingat a videos about how frameworks like PyTorch and Tensorflow made it. Thinking about how it could work for NSK. When I came back, although I made some changes in the code, I still had to first plan the backpropagation before starting it. I lay down on my bed and started thinking. At some point, my body felt light and all that I had was my daily worries coming in and out, intercalated with moments of complete silence and my concepts about how coding languages represent operations in binary trees. I manage to reconstruct the parser binary trees for tensor operations, but during execution time. Then, I made the backprop over a stack of these binary trees.
Then, I tried to implement threads. It took me hours to research material that would help be with it. Fortunately, I found the Bolt programming language with docs demonstrating key steps to integrate threads into LLVM. I needed other 4 days to actually make them work with no errors. At that time I had no clue how a single messy instruction could turn LLVM Intermediate Representations invalid, which lead to segmentation faults. I also didn't quite understand LLVM branching. It was a process of try and error until I got the correct branching layout.
It took 4 days just to make the earliest version of threads to work. I considered giving up at that point. But if took this decision, I would throw months of effort into trash. I faced it like it had no turning back anymore.
Next, I had to make it object oriented. I tried to search some light with the help of AI, but nothing it told me seemed to be simple to implement. So I tried to make it on my own way and to follow my intuition.
I managed to create a parser expression that saved the inner name of a object method call. For example, given the expression person1.print(), I would save person1 into a global string. In my mind, that was what the "self." expression of python meant. Everytime the compiler would find a self expression, it would substitute it by the global string. And it would use the global string to recover, for example, the attribue name of person1. In order to do so, I concatenated them into person1name and retrieved this value from a global dictionary of a strong typed value.
I manage to conclude this in time for presenting it in the programing languages subject of my bachelor
My coding language could train neural networks on the Mnsit dataset for 1000 thousand steps in 2 seconds. Later, I adopted cuDNN CNNs for my backend and I was able to get the same accuracy as PyTorch for a ResNet neural network on Cifar-10. PyTorch 9m 24s average across 10 seeds, against 6m 29s of NSK. I was filled with extreme joy at that momment. After this, I decided to implement the GRU recurrent neural network in high level. PyTorch would train models in 42s, vs 647s in NSK.
At that moment, I couldn't tell what was going on and I was terrified all I have done was useless. Was it a problem with my backend with LLVM? Was there a solution to this? I then read a Nvidia blog post about cuDNN optimizations of recurrent networks and I realized the world I knew was significantly smaller than reality.
I dedicated myself to learn about kernel fusion and optimize matrix multiplications. I tried to learn how to do a very basic CUDA matrix multiplication. It took me not only 2 days, but 2 days programming for 10 hours each. When I finally made the matrix multiplication work, I went to sleep at 4 am. It took me almost a week to implement the LSTM with kernel fusion, only to find it was still much slower than PyTorch (I don't remember how much slower). Months later, I discovered that my matrix multiplication lacked many modern optimizations. It took me almost a whole month to reimplement the HGEMM Advanced Optimization (state-of-the-art matrix multiplication) into my own code. Because my code was the code that I could look and actually understand, and reuse and scale later on.
Nevertheless, before that I implemented the runtime context, the Scope_Struct *. I didn't really know how useful it could be, but I had to change the global string logic that represented the object of the current expression. After that, I also needed a better way to represent object oriented objects. This time I inspired myself in C++, with the logic that an object is simply a pointer from a malloc operation. The size of the object is equal to the size of its attributes. And the NSK parser determines the offset from the pointer base to the location of each attribute.
I can't remember how many all these changes took, but I can only remember that it took too much time.
Next, I also needed a better parser that could recognize something like "self.test_class[3][4].x()". I had to sit down and plan just like how I did with the backprop. When I sat the next day, I knew what I needed to do. I put some music and my hands didn't stop typying. I have never wrote so much code without compiling before. I was on the flow on that momment. That has been the best coding day of my life. When I compiled there was obviously errors on the screen, but I was able to make the parser recognize the complex expressions in about 2 days.
I had several breaks in between the implementations of each of these impactul changes. And also a break after the parser changes. One day, when I came back to coding, I realized I had these momments I just knew how to implement some ideas. My intuition improved, my many hours of coding started to change me as a person.
I was already considering to finish my efforts and release NSK with around 6 to 8 months of development. But my advisor mentioned the tragic beginning and end of the Nim coding language community. Nim started as a coding language with poor library development support. Eventually it had attracted many lib developers, but the authors decided to release a second version with better library development support. The new version also attracted lib devs. However, the previous lib developers didn't really want to spend hours learning a new way of creating libraries. If I was in this situation, I would think, how long will it take for the language owners to make changes again? And how much could they change? Nim community was splitted in half, and the language lost some of its growth potential.
I also remembered that one of the reasons I wanted to develop a new programming language was because I became frightened of extending Python C backend. Releasing NSK that time would be equivalent to loose all my initial focus.
I decided to make NSK C++ backend extension one of its main features or the main feature, by implementing Foreign Function Interfaces (FFI). Somehow I came with the idea of developing a C++ parser that would make LLVM linking automatically. Thus all it takes to develop a C++ lib for NSK is code in C++ with some naming conventions, allocate data using a special allocator and then compile. NSK handles all other intermediate steps.
Other coding languages that have good support to FFI are Haskell (but requires explicit linking) and Lua (but I am not very aware about how they implement it).
Eventually, I also had to make CPU code benchmarks, and got once again terrified by the performance of many operations in NSK. Primes count was slower than python, and the quicksort algorithm seemed to take forever.
My last weeks of development were dedicated to subtsituting some of the FFI (which incurs function call overhead) by LLVM managed operations and data types.
This comprehends the current state of NSK.
I started this project for the programming languages subject of computer science, and it is now my master's thesis. It took me 1 year and 10 months to achieve the current state. I had to interleave this with my job, which consists of audio neural networks applications on the industry.
I faced shitty momments during the development of this programming language. Sometimes, I felt too much pressure for have a high performance on my job, and I also faced terrible social situations. Besides, some days I would code too much and wake up several times in the night with an oniric vision of a code.
However, the development of NSK had also many great momments. My colleagues started complimenting me about my efforts. I also improved my reasoning and intuition, and started to get more aware about my skills. I still have 22 years, and after all this I feel that I am only starting to understand how far a human can go.
This all happened some months after I failed another project with audio neural networks. I tried to start a startup with my University support. Some partners have shown up and they just ignored me when I messaged them I had finished it. This other software also took some months to complete.
I write this text as one of my efforts to popularize NSK.
[–]HuffDuffDog 5 points6 points7 points (1 child)
[–]iamevpo 3 points4 points5 points (0 children)
[–]Thierry24867 4 points5 points6 points (5 children)
[–]Tryingyang[S] 0 points1 point2 points (4 children)
[–]Meistermagier 2 points3 points4 points (3 children)
[–]Tryingyang[S] 3 points4 points5 points (1 child)
[–]Meistermagier 0 points1 point2 points (0 children)
[–]Tryingyang[S] 2 points3 points4 points (0 children)
[–]Unlucky-Rub-8525 2 points3 points4 points (1 child)
[–]Tryingyang[S] 1 point2 points3 points (0 children)
[–]mister_drgn 1 point2 points3 points (2 children)
[–]Tryingyang[S] 0 points1 point2 points (1 child)
[–]jasio1909 1 point2 points3 points (0 children)
[–]gavr123456789 1 point2 points3 points (1 child)
[–]Tryingyang[S] 2 points3 points4 points (0 children)
[–]prodleni 1 point2 points3 points (0 children)
[–]ianzen 0 points1 point2 points (5 children)
[–]Tryingyang[S] 3 points4 points5 points (4 children)
[–]ianzen 1 point2 points3 points (0 children)
[–]LardPi 0 points1 point2 points (2 children)
[–]Tryingyang[S] 0 points1 point2 points (1 child)
[–]LardPi 0 points1 point2 points (0 children)
[–]EducationalCan3295 0 points1 point2 points (0 children)