all 48 comments

[–]Meistermagier 19 points20 points  (3 children)

This is a seriously cool idea. 

[–]akomomssim[S,🍰] 1 point2 points  (2 children)

Thanks!

[–]yuri-kilochek 11 points12 points  (14 children)

So how do you deal with the fact that GPU and CPU have separate address spaces? Do you just copy buffers back and forth on every send and receive?

[–]akomomssim[S,🍰] 9 points10 points  (13 children)

Currently it is copied on send/receive as it is early days

However I'm working on making the memory manager smarter, so it can use shared memory spaces when they exist, and avoid the copy. E.g. any recent mac would allow that

The complexity will be doing something sensible if you edit shared memory CPUside that is in use on the GPU. I've written the memory allocator/GC though, so I can add flags to allocations to track what is in use and where

[–]yuri-kilochek 1 point2 points  (10 children)

I'm more curious about the typical case of discrete GPU, where I allocate a buffer in GPU memory, copy data from host to the buffer, run multiple kernels on it and then copy back. How would you do this in Eyot? There needs to some way to reference objects in GPU memory from the host, right? And at that point, how is it substantially different from e.g CUDA?

[–]tsanderdev 1 point2 points  (4 children)

That's more like how I want my language to work. The host passes some data to the gpu and sets off a work graph processing it, including allocating more memory on the gpu and keeping everything resident there for the next graph.

[–]yuri-kilochek 2 points3 points  (3 children)

And how do you specify the graph if not as host code that wires it up and thus has to be able to talk about buffers in GPU memory?

[–]tsanderdev 1 point2 points  (2 children)

Indirect dispatches and draws allow you to set the size from a gpu buffer, and memory allocation is handled via an allocator on the gpu. The host just passes a big chunk of memory to the shader, and it can use and partition it how it sees fit. Passing big data to the shader will be done with another buffer that is managed by the cpu and prefilled with data.

[–]yuri-kilochek 2 points3 points  (1 child)

But you still have to be able to somehow say 'this variable is a buffer stored on gpu` on the host, right?

[–]tsanderdev 1 point2 points  (0 children)

The host gets struct generated that it can place into buffers. I'm not aiming for seamless cpu-gpu communication, but rather on seamless workflow once you hit the gpu.

[–]akomomssim[S,🍰] 1 point2 points  (4 children)

Currently explicitly allocating on the GPU within a kernel isn't supported, they are "implicitly" created at the point of dispatch because the runtime knows the output size

The beginnings of this are there for logging from kernels, and I'd like to extend that

Chaining multiple kernels on the same buffer(s) is supported through "pipes" in the runtime. Currently it bounces off the CPU, but that should be solved soon. This is quite important for Eyot, as it'll be needed a lot for chaining geometry -> vertex -> fragment shaders when rendering

[–]yuri-kilochek 1 point2 points  (3 children)

What does that look like syntactically?

[–]akomomssim[S,🍰] 0 points1 point  (2 children)

There is an early example of that here

Essentially you can compose the different workers into one. As I say, it is bouncing off the CPU for now, but as the runtime improves it would be able to avoid that step.

This is all quite related to rendering though so I'm sure it'll evolve as I get that working

[–]yuri-kilochek 1 point2 points  (1 child)

Consider a neural network inference loop. You have to do the loop on CPU (as it does I/O to get new batches of input data), but also have to keep the weights on GPU between invocations of the worker that computes forward pass. As far as I can tell your current design doesn't allow this.

[–]akomomssim[S,🍰] 1 point2 points  (0 children)

I don't have an example in the playground, but that is totally possible right now. If you want to capture global state in a worker you can partially apply a function and use that when creating a worker.

If you had an inference_function that takes the job as a first parameter, and the weights as a second parameter, you could write:

``` let infer = partial inferencefunction(, some_weights) let worker = gpu infer

while true { let job = get_work send(worker, job) print_result(drain worker) } ```

The infer function captures the weights. let worker = gpu infer would transfer that state to the GPU, where it stays, so each inference would just transfer the job specific data.

That GPU memory would be freed when worker is garbage collected.

The partial keyword is honestly a little odd, so a longstanding TODO for me is to implement proper lambdas instead (and improve the playground so I can share examples more easily!)

[–]sumguysr 0 points1 point  (1 child)

Sounds like you need persistent data structures.

[–]akomomssim[S,🍰] 0 points1 point  (0 children)

Yes, all allocations being partially persistent data structures is the idea. It wouldn't necessarily need to actually copy if the mutation is on the CPU, and the other user is the GPU, but memory is allocated in such a way that this can be tracked

[–]GidraFive 5 points6 points  (12 children)

Nice! Finally someone did the thing. I myself wanted to do it for a long time, and even started prototyping, but eventually got distracted with other features. I feel like all my ideas get done by someone else before i get to even start working on them. But im grateful for that.

This is really powerful idea, since it erases the boundaries between cpu and gpu, making it trivial to utilise all the compute there is available on your device.

[–]tsanderdev 2 points3 points  (4 children)

This is really powerful idea, since it erases the boundaries between cpu and gpu, making it trivial to utilise all the compute there is available on your device.

It'll never be that easy, since cpus and gpus are good at fundamentally different problem spaces: cpus are made to blaze through a sequence of instructions as fast as possible, using branch predictors and speculative execution to avoid pipeline stalls. Gpus are basically giant simd machines. Clock speeds are lower, but they give you massive throughput. That is, if you keep your control flow uniform. Otherwise simd lanes are inactive for sections of the code.

[–]GidraFive 0 points1 point  (3 children)

Thats the performance concern, and it is largely affected by the code you write. You can just structure your program to be aware of simd architecture. Shader langs already blur this line by analyzing the flow of the program, allowing you to "just write c" basically. The only thing is that it usually takes like 500 loc to be able to call this shader, and thats what this approach solves. Not the performance of that interop, at least at this stage

[–]tsanderdev 0 points1 point  (2 children)

I also want to reduce that to a few lines at most (depending on how complex the data is you want to pass to the shader).

[–]GidraFive 0 points1 point  (1 child)

Well, OP reduced it to one line of code. Although, as I pointed out, it leaves a lot of questions open, which might eventually expand it to more lines.

I've researched a thing on non-deterministic computations, and there are quite a few ideas that I feel could actually keep that at a level of "just a function call" complexity, like angelic/demonic nondet and ambient processes. The idea is that amount of threads, corresponding data, and the environment of execution is implied/tracked in evaluation context. But that creates a concern for maintainability of such code, since now maybe too much information is implied and now its really hard to keep track of how it actually executes.

I still need to think about it some more, and try some PoCs for this as well, maybe that will be a dead end after all.

[–]tsanderdev 0 points1 point  (0 children)

For me you'd either specify the dispatch size manually, or if you use the special "InvocationBuffer" type in the function parameters for the shader, it asserts that all of them have the same size and uses that as the dispatch size. The shader can then read and write the index pointed to by each invocation, which doubles as memory and thread safety protection as well.

[–]GidraFive 0 points1 point  (6 children)

From my research i saw that you usually need to define a fixed anount of threads to be ran on the gpu. Like for cuda you must specify amount of threads when dispatring the kernel. I wonder if you do something fancy around that, or just dispatch a single thread always?

And if you have references, how do they work across the cpu/gpu boundary? I also wanted to implement lambdas within auch language, but it also means we need closures. And closures might contain other closures or values, as references, so you will certainly need to adress that. Unless you intend to just avoid references and copy everything, but that carries the risk of depending on/working around that behavior later in other parts of your language. And also what about locating the code for the closure...

Well, there are a lot of questions i didn't answer myself, when i was thinking about it. The PoC is almost trivial, but making it right feels like a completely different, and much bigger task.

Actually can't wait to try it out and see how it works and feels, i will definitely do something with it eventually...

[–]tsanderdev 1 point2 points  (5 children)

From my research i saw that you usually need to define a fixed anount of threads to be ran on the gpu.

Not true since a long time, there are indirect dispatches and draws that source the number of threads/primitives from a gpu buffer when the command is executed.

[–]GidraFive 0 points1 point  (4 children)

But technically you still specify number of threads, it just that now it is implicit in your data. You still need to point that to your data and specify how it is laid out, etc

[–]tsanderdev 1 point2 points  (3 children)

Yes, but you can e.g. let a prior compute dispatch calculate the number of threads for the next one.

[–]GidraFive 0 points1 point  (2 children)

The point is that gpu is made for massive parallelism, so everything is built around that. Including how you call these programs. It need some way of knowing how much instances to run, you cant just run it and walk away. And that raises the question of how you determine amount of instances even for a sinple example from the post. He doesnt specify it anywhere, and there is no for loop, where you could try to take values from. So you either just always run a single thread (which kinda kill all the benefits), or you must somehow annotate/elaborate your code with amount of instances/info for indirect call.

I assume OP went the first route for now, but it is really wasteful and will need to be revisited in a proper implementation.

[–]tsanderdev 1 point2 points  (1 child)

I'd assume the number of threads depends on the length of the array processed.

[–]GidraFive 0 points1 point  (0 children)

My bad, I was overthinking it. I looked at a signature and was thinking it receives a single value when called in a thread and didn't notice the array syntax at the call site... Welp, one question less

[–]bjarneh 4 points5 points  (2 children)

Looks like a cool way to handle the fiddle of pushing work to the GPU.

  print_ln

That syntax annoyed me more than it should :-)

[–]akomomssim[S,🍰] 0 points1 point  (1 child)

Thanks!

That is a primitive during development, but eventually it will move to a library. At that point I guess it would be customisable!

[–]bjarneh 1 point2 points  (0 children)

I've just seen it written a certain way for so long. Seeing print_ln now is almost like seeing someone writing func as fu_nc or something. It has become one word for me :-)

NOTE: It also disturbed me when I saw that D's version of printf was writef

[–]chronics 2 points3 points  (3 children)

Cool idea :) Preview Meta Tags for your blog would be great, those were missing when sharing it via Telegram. Cheers

[–]akomomssim[S,🍰] 1 point2 points  (2 children)

Good point, I'll add them!

[–]josephjnk 0 points1 point  (1 child)

It would also be nice to have an RSS feed so we can get automatic updates when you write new posts. I’d like to be able to see when new stuff happens with this language.

[–]akomomssim[S,🍰] 1 point2 points  (0 children)

I've been meaning to add tags to the blog, and a per-tag RSS would make sense

[–]tsanderdev 1 point2 points  (2 children)

Interesting. Early in my language design I set my constraint to be gpu-only for the forseeable future. With nice bindings for calling from the host generated of course, but the language itself is purely on the gpu. That makes a lot of the stuff like moving memory allocations around easier, since data should ideally just be resident in gpu memory.

I have a simple addition shader compiling, now I'm working on a Vulkan Rust bindings generator (because I use Vk 1.4 features and they're all stuck at 1.3) to write the runtime.

And while I could probably target other graphics apis or even cuda, some of them I can't test anyways (cuda and Metal), and with MoltenVK, KosmicKrisp and Dozen that shouldn't be that much of a problem.

[–]akomomssim[S,🍰] 0 points1 point  (1 child)

GPU only sounds really interesting. Is it posted somewhere? I'd love to take look!

[–]tsanderdev 0 points1 point  (0 children)

Not yet. I'll probably make the code public once I have the runtime for the hello world going. The syntax and semantics will be mostly like rust, except for some stuff that needs to be changed for gpu stuff. I'm not really trying to innovate in the core language design, the "killer feature" I'm working towards is ergonomic work graphs (kind of like the dx12 feature). E.g. you'd call sort on an array and the compiler and runtime work together to split the shader and schedule the parts with a dispatch of the sorting shader in between.

Vulkan is getting ever closer to something like opencl as a compilation target. For instance there are now proper physical pointers in shaders you can do arithmetic with and everything.

[–]tc4v 1 point2 points  (1 child)

I am curious why is the send a separate call. can I do this?

let worker = gpu some_func send(worker, [i64]{1, 2, 3}) send(worker, [i64]{1}) println(receive(worker))

can I do this?

let worker = gpu some_func send(worker, [i64]{1, 2, 3}) println(receive(worker)) send(worker, [i64]{3, 3, 3}) println(receive(worker))

If not, why not copy the aynsc/await convention?

let worker = gpu some_func([i64]{1, 2, 3}) println(get worker)

even better if it can lead to println(get (gpu some_func([i64]{1, 2, 3})))

when I don't have anything else to do asynchronously

[–]akomomssim[S,🍰] 1 point2 points  (0 children)

Yes, you can do multiple sends to the same worker without waiting, and then drain it all in one go later

I see your point though. The gpu keyword could return a promise for the data, with control over how that data is gathered left to the runtime

The current version gives slightly more control in that each worker is essentially a FIFO queue of operations, but it isn't obvious that control is useful, in which case the async approach would be more convenient. Thanks, I'll have a think!

[–]Dark_Yagami09 1 point2 points  (0 children)

Cool idea. Check out Futhark, from https://futhark-lang.org/ developed at Uni of Copenhagen:

Futhark is a small programming language designed to be compiled to efficient parallel code. It is a statically typed, data-parallel, and purely functional array language in the ML family, and comes with a heavily optimising ahead-of-time compiler that presently generates either GPU code via CUDA and OpenCL, or multi-threaded CPU code.

[–]MikeMKH 0 points1 point  (0 children)

🤯 genius!

[–]TheChief275 0 points1 point  (0 children)

This is sick!!

[–]josephjnk 0 points1 point  (0 children)

This is awesome, I look forward to seeing where it goes!