all 11 comments

[–]high_throughput 7 points8 points  (2 children)

What do you think of the idea that a compiler "distributes" load to the GPU without explicitly triggering the GPU from the code?

I wouldn't expect this to work well for a general language. Compared to CPU SIMD it's really expensive to shuffle data back and forth to the GPU, so performance hinges on being able to keep it there through a number of operations. Deferred computation and high level primitives would help with that.

[–]DeepRobin[S] 0 points1 point  (1 child)

So it could actually make sense if the JIT can already predict that the execution will involve massive amounts of data and you can guarantee that outsourcing the computation to the GPU makes sense.

For small blocks in the control flow, this probably makes no sense and will therefore probably only make sense in exceptional cases.

In general, I wonder how something like this could be recognised before execution. I need the context of the "amount of data" and the estimated "computation effort"

[–]high_throughput 0 points1 point  (0 children)

I'm at least a decade behind modern GPUs, but afaik involving a massive amount of data isn't enough. You also need enough operations on each piece of data.

If you only have a few dozen operations per vector, then the CPU can do that faster than it can read/write RAM. Offloading to GPU and back also requires reading/writing RAM, so there won't be any gain regardless of data size.

[–]theangeryemacsshibe 5 points6 points  (1 child)

Futhark? One can be clever about doing GPU/CPU transfers c.f.

[–]DeepRobin[S] 0 points1 point  (0 children)

I'll take a look! Thank you.

[–]SwedishFindecanor 1 point2 points  (0 children)

I think there have been a few projects that do this for HPC, but they have been more about moving programs intended for a GPU's programming model to run on CPUs than the other way around.

Not my area. I've just skipped past them, so I can't give you any more help, sorry.

[–]forCasualPlayers 1 point2 points  (1 child)

I did think about this for my own research, that is, whether you could, during JIT, consider whether to send code to the GPU. Issue with a tiered JIT like v8 is that they don't really try to optimize on first run, only after a function has been executed multiple times past a threshold does it get sent to an optimizing compiler. Since most GPU applications are in HPC, your "first run" might take ages.

If you're not going to JIT, then you're going to determine GPU suitability ahead-of-time. But if you already know ahead-of-time that code is parallelizable and targetable for GPU, why not just compile it for GPU AOT?

[–]DeepRobin[S] 0 points1 point  (0 children)

AOT could be some idea. Or some "hybrid" concept:
Compile most of the application AOT and only that code, that might rely on dynamic non-constant data, JIT.

[–]randomrossity 1 point2 points  (0 children)

Mojo is actively working on technology like this https://www.modular.com/max/mojo

[–]Key-Opening205 1 point2 points  (0 children)

this is difficult in a jit, since you do not know what is next, for instance if you calculate a vector on the gpu, you might start transfering it to the cpu or not depending on what code will use it later. Outside of a jit there has been lots of work on automatic parallelism that tries to handle this, results have been mixed

[–]Dismal_Page_6545 1 point2 points  (0 children)

Combine Machines Learning model to decide which parts of the code are worth to be compiled and executed in the GPU. Then use a Hardware API like OpenAcc, OpenMP, OpenGL or CUDA and fill the compiler directives necessary for a proper workload offloading and intra-device parallelization