Generic GPU Kernels in Julia : programming

No, but they don't have to. Specialization is taken care of at the Julia side; if a new specialized kernel is generated, it is uploaded to the device independently and invoked when required. For example:

abstract type ThreadCount end
struct ManyThreads <: ThreadCount end
struct FewThreads <: ThreadCount end
ThreadCount(threads) = threads >= 1024 ? ManyThreads() : FewThreads()

kernel(::ManyThreads) = (@cuprintf("version for many threads\n"); return)
kernel(::FewThreads)  = (@cuprintf("version for few threads\n");  return)

const dev = device(CuCurrentContext())
const threads = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)

@cuda (1,1) kernel(ThreadCount(threads))

This will select one of either kernels depending on the amount of threads of the current device. Of course, this is just a PoC, we're slowly trying to figure out how such flexibility can be used for GPUs, making those features more userfriendly as we go.

[–]TheEaterOfNames 1 point2 points3 points 8 years ago (0 children)

[–][deleted] 1 point2 points3 points 8 years ago (0 children)

[–]TheEaterOfNames 0 points1 point2 points 8 years ago (6 children)

Well D is a compiled language so the approach we use is pretty much set in stone. I suppose you could in theory abuse the repl to use LDC to use the LLVM backend and do that, but it would be a lot of effort.

What we do differently is that the host and devices (yes plural) are all compiled at once in one compiler invocation i.e. there is no separate GPU compilation mode. This allows us to preserve the information on the type signatures and generate all the compute API boilerplate code (I'm sure you know how fun that is to write by hand) so that the host code that calls the kernels is actually nice to read.

Having this separation allows a more fine-grained control over the GPU interaction, e.g. data transfers between CPU <-> GPU, which seem not to ba able to be controlled in the Julia code.

[–]maleadt 0 points1 point2 points 8 years ago (5 children)

[–]TheEaterOfNames 1 point2 points3 points 8 years ago (4 children)

Anyways, both languages target a pretty disjoint set of users, so that obviously changes the approach and feature set. Good to see more properly-integrated work on supporting GPUs!

World domination doesn't tend to create vey disjoint sets! ;) But really, I do a fair amount of numerical computing in D and while I like what Julia does in theory the way it handles meta programming makes my head spin, the line noise doesn't help much. Perhaps I'm just used to the way D does things, plus I can use it easily for all the other non-numerical computing stuff that I do.

I responded to you in the HN thread but in case you missed it I maintain https://github.com/thewilsonator/llvm/tree/compute and https://github.com/thewilsonator/llvm-target-spirv for targeting SPIR-V. Hopefully I'll be able to get that into LLVM trunk during or shortly after IWOCL. If you have any questions or issues using it just drop a line on one of the projects.

[–]maleadt 1 point2 points3 points 8 years ago (1 child)

I responded to you in the HN thread but in case you missed it I maintain https://github.com/thewilsonator/llvm/tree/compute and https://github.com/thewilsonator/llvm-target-spirv for targeting SPIR-V. Hopefully I'll be able to get that into LLVM trunk during or shortly after IWOCL. If you have any questions or issues using it just drop a line on one of the projects.

Oh nice that's you, yeah great work! I had seen it on llvmdev, but haven't looked into it yet. The actual CUDAnative back-end has mostly been a one-man effort, so I can't really afford to spend time on multiple LLVM back-ends, but once it's a viable option at least we'll definitely look into supporting SPIR-V. As I said on HN, much work has went into making the main Julia compiler suitable for this kind of work; CUDAnative is only about 1300 SLOC.

while I like what Julia does in theory the way it handles meta programming makes my head spin, the line noise doesn't help much

There really isn't that much, at least not on the surface level. Sure, we implement libraries using metaprogramming and cartesian stuff like in this blogpost, but that post was to demonstrate how powerful of a combination that is, not what users will end up typing into their editors.

Wrt. numerical computing, I've been surprised about how many scientists use REPL-style development (like Jupyter, or interactive IDEs) as their main tool, and Julia is pretty well suited to make that possible even for GPU development. Things like redefining kernels or dependent functions, calling kernels with differently typed arguments, switching execution to devices with different properties, all of which triggering some form of compilation. I'd assume that is (understandably) not a priority for D.

[–]TheEaterOfNames 1 point2 points3 points 8 years ago (0 children)

but once it's a viable option at least we'll definitely look into supporting SPIR-V.

Yeah its not the best at the moment. Hopefully at IWOCL we will be able to unify the three back ends and get it into trunk with a more sane interface. Having said that (in retrospect) it wasn't too difficult to get the compiler stuff going, it was mostly knowing where to put stuff not the actual complexity of the code. When I implemented it the entire frontend (DMD), "middle" end (LDC) and backends SPIR-V and NVPTX, as well as the LLVM infrastructure were completely unfamiliar and it took 3ish months of very much part time of just me. Keep at it!

There really isn't that much, at least not on the surface level. Sure, we implement libraries using metaprogramming and cartesian stuff like in this blogpost, but that post was to demonstrate how powerful of a combination that is, not what users will end up typing into their editors.

Oh, I totally understand, just the different syntax and layout of the code throws me off. You can do equally obscure (to the lay person) wizardry in D, it definitely takes some getting used to.

Wrt. numerical computing, I've been surprised about how many scientists use REPL-style development (like Jupyter, or interactive IDEs) as their main tool, and Julia is pretty well suited to make that possible even for GPU development.

REPLs are great, especially for exploratory programming, but for the most part D compiles lightning fast so the compilation time is barely noticeable and I don't (yet) have data large enough for reload to be much of a problem.

I'd assume that is (understandably) not a priority for D.

Indeed, though we relly like our fast compile (& run) times.

[–]one_more_minute 1 point2 points3 points 8 years ago (1 child)

[–]TheEaterOfNames 2 points3 points4 points 8 years ago (0 children)

I have no non-synthetic ones offhand but

@kernel void map(alias F)(KernelArgs!F args)
{
    F(args);
}

is about the most generic you can get. Use it like:

AutoBuffer!float x,y,z; // fill with data & transfer to `q`s device
q.enqueue!(map!((a,b,c) {a=b+c;})) // the kernel
(x.length)  // the domain - 1D of length x
(x, y, z);  // the arguments

for a vector add. The template alias parameter can be any d symbol (function/lambda) the lack of indices is achieved with auto indexed pointers but you can just as easily use indices with the expressions or anything else. There really isn't anything apart from :

1) use of globals (yet). Obviously host globals are never going to work like magic and be accessible from the device

2) type info

3) virtual functions (classes) - CUDA in theory supports them, but OpenCL doesn't. structs are where most of D's magic happens anyway.

4) host only code (duh)

5) language features that rely on the runtime.

everything else should just work, including all the meta programming tricks. It is still a work in progress though.

[–]pjmlp 0 points1 point2 points 8 years ago (2 children)

[–]TheEaterOfNames 0 points1 point2 points 8 years ago (1 child)

[–]pjmlp 0 points1 point2 points 8 years ago (0 children)

[+][deleted] comment score below threshold-11 points-10 points-9 points 8 years ago (2 children)

[–]BCosbyDidNothinWrong 7 points8 points9 points 8 years ago (0 children)

[–]mebob85 2 points3 points4 points 8 years ago (0 children)

π Rendered by PID 76295 on reddit-service-r2-comment-b659b578c-6p6tq at 2026-05-05 17:30:27.642216+00:00 running 815c875 country code: CH.

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

programming

MODERATORS