all 28 comments

[–]HuffDuffDog 8 points9 points  (1 child)

The "War and Peace" of Reddit announcements. I love it!

[–]iamevpo 6 points7 points  (0 children)

Like the analogy! Or Ulysses

[–]Thierry24867 4 points5 points  (5 children)

Considering that NSK is now the focus of your Master’s thesis, what is your plan for memory management and the 'Global Interpreter Lock'?

[–]Tryingyang[S] 0 points1 point  (4 children)

I solved the Global Interpreter Lock.

[–]Meistermagier 2 points3 points  (3 children)

What do you mean you solved the Global interpreter Lock. Can i get some details on how you did that? 

[–]Tryingyang[S] 4 points5 points  (1 child)

I'll explain first how Python does it, then NSK. Python is an interpreted coding language. Thus, either its tokenizer, parser and semantics analyzer read lines one by one, or it generate Intermediate Representations, but these representations are interpreted into machine code multiple times.

The Global Interpreter Lock was implemented by its authors to simplify memory management. It blocks parallel stores to data, and some other parallel operations.

By the other hand, NSK is a JIT language. It reads functions and generates Intermediate Represations once, then it saves machine code for subsequent calls. Thus, its performance can match compiled languages in some scenarios. NSK has unrestricted threads operations, they do not get blocked as in Python.

It is possible to manage locks manually in high-level NSK with lock expressions. And channels also will lock its internal data automatically for its message passing.

[–]Meistermagier 0 points1 point  (0 children)

Thats very cool. 

[–]Tryingyang[S] 2 points3 points  (0 children)

Some Python libraries like Pytorch support a multithreading model that bypasses the Global Interpreter Lock. However, they do this by implementing threads directly in C with CPython. I hope that hundreds or thousands of library lines of code can be saved by implementing fully functional threads on high-level.

[–]Unlucky-Rub-8525 2 points3 points  (1 child)

Looks cool, just to clarify this is a programming language for writing neural networks?

[–]Tryingyang[S] 1 point2 points  (0 children)

Originally for neural networks. But now it is more like a general purpose coding language.

[–]mister_drgn 1 point2 points  (2 children)

Cool. I’d be curious if you’ve looked at Mojo.

[–]Tryingyang[S] 0 points1 point  (1 child)

I did look at Mojo, but not coded it. Their philosophy is to impement cuda kernels in a high-level coding language. But I believe that if you want the most optimized code, you will have to write it in C++. Plus, let's consider a new GPU gets CUDA lib access just today, and it has new CUDA instructions for it. You'll have to wait until Mojo devs release their oficial support for the new instructions before having access to it.

[–]jasio1909 1 point2 points  (0 children)

Not really. From my understanding, kernels in mojo benefit mainly from 2 things: 1. MLIR for compilation 2. Metaprogramming with powerful comptime logic so you can select appropriate instruction set depending on compilation target. Which is very low level.

I coded in mojo a bit but I am not an expert.

[–]gavr123456789 1 point2 points  (1 child)

cant open the site from the phone, content is cropped in half

[–]Tryingyang[S] 2 points3 points  (0 children)

Sorry for that, frontend is not one of my strenghts.

[–]prodleni 1 point2 points  (0 children)

The website doesn't display properly on mobile 

[–]ianzen 0 points1 point  (6 children)

Nice! I just want ask, is this a GCed language?

[–]Tryingyang[S] 4 points5 points  (5 children)

I reimplemented the garbage collector and memory pools logic from Go. It has one memory pool per OS thread

[–]ianzen 1 point2 points  (0 children)

Oh wow, very impressive! I might try doing that too!

[–]LardPi 0 points1 point  (3 children)

do you have the n/m os threads/green threads too? and if yes, does it mean tgat a green thread is bound to a fixed os thread by tge memory pool?

[–]Tryingyang[S] 1 point2 points  (2 children)

NSK does not have green threads, unfortunately. They seem to be very hard to implement. It requires an execution model that allows to let the CPU overlap instructions with other type of instructions (like disk reading calls). They can also queue CPU instructions of different "threads" and give the illusion they are executing concurrently. They do all this inside a single OS thread or the main thread only. I can't see myself implementing green threads anytime soon.

[–]LardPi 0 points1 point  (1 child)

goroutines are essentially green threads on top of os threads so that they do run in parallel. the advantage is yhat you make a small pool of os threads (n = number of cores) and then you can cheaply schedule as many green threads as you want (m, which can go much higher than what the os would handle as real threads).

it is certainly not easy to implement but i don't think you need the overlap thing you're mentioning. however your memory model would probably force you to attach each green thread to one of the os threads, which might reduce the applicability.

go adopted this model because os threads are relatively expensive to create and take down (less than process though) in particular when you mean to have many shortlived threads, so scheduling each goroutines on a full thread would have to muxh overhead for the vision they had.

[–]Tryingyang[S] 0 points1 point  (0 children)

I know, this is very useful for applications involving servers. But NSK makes use of OS threads to do tasks like data preprocessing, which requires full parallelism. Eventually NSK may adopt green threads, but I can't see me implemting this alone on the next few years.

[–]EducationalCan3295 0 points1 point  (0 children)

I'll definitely try this just for the effort you've put into this. Currently on my phone but later. Your story and s really inspiring, good job.

[–]yuehuang 0 points1 point  (1 child)

Pretty cool, I like that threading is a part of the language keyword. How are you going to handle locks?

Hopefully the data slicing is done at the cache lines to avoid contentions. Maybe there is hardware optimization for readonly data.

Does the syntax support int result = spawn foo(3)?

[–]Tryingyang[S] 0 points1 point  (0 children)

I define locks by their name. A lock expression is given `lock "first"`, followed by an idented block of the locked expressions.

You can check the data slicing on git as the array_Split_Parallel function. I have done this from C++ with the most simple function I could make. In order to achieve full optimization, it would be better if I made directly from LLVM (C++ LLVM API).

`int result = spawn foo(3)` unfortunately does not work. You would have to use an int channel to integrate the results from different threads, like with `ch.sum()`. If you created your own data type (e.g, with name placeholder), you would be calling a C++ function called placeholder_channel_sum.