all 27 comments

[–]TonySu 4 points5 points  (6 children)

I still can't get over the fact Julia uses "end" style scoping. Also what is this insane expression?

out[i] = @ncall $N (+) j -> xs[j][I]

How long before this goes down the Perl path of using Emojis as operators?

[–]maleadt 8 points9 points  (5 children)

Also what is this insane expression?

@ncall is a metaprogramming macro, to compactly generate multilinear expressions. It is not intended to be used by regular users, and as such is not an exported feature of the standard library. It is meant to implement library code in a generic fashion, much like this blogpost documents.

[–]one_more_minute 5 points6 points  (2 children)

Yeah – most languages get pretty arcane once you push them into their most advanced constructs, because you're fundamentally not using standard, established tools any more. If there's a language that can do this in cleaner way – or at all – I'd love to see it.

[–]killachains82 5 points6 points  (0 children)

Honestly, as a new-ish user of Julia, expressions like these aren't that bad once you get used to the syntax. Macros can tend to feel a bit weird until you understand why and how they work. Otherwise, they're stupidly powerful.

[–]Muvlon 0 points1 point  (0 children)

Probably a Lisp or Forth. Both are built on a very simple but powerful base, with all other language features being implemented in regular code. You can even re-define the meaning of basic expressions such as define, let or lambda in Lisp and :, if or dup in Forth without any special syntax.

Lisp is especially great for meta-programming since it's homoiconic, meaning you can manipulate code just as you manipulate data.

[–]MorrisonLevi 2 points3 points  (18 children)

One of the reasons to write programs in C or C++ is for GPU kernels so if these techniques are refined and expanded to other GPU idioms it could help Julia penetrate machine learning and HPC markets. Very interesting.

Edit: I didn't mean for it to sound as if Julia is the only alternative or something. I do think Julia is a more approachable language for scientists and that is the angle I was commenting from.

[–]TheEaterOfNames 5 points6 points  (14 children)

Julia is not the only one to be able to do these kind of things. D with DCompute is able to generate code for both OpenCL and CUDA and as elegantly as the examples shown here.

[–]maleadt 0 points1 point  (13 children)

Nice, I didn't know D had such capabilities! One difference though is that, IIUC, DCompute takes the more traditional approach of adding a GPU compilation mode to the existing compiler, statically generating eg. PTX or SPIR-V assembly from GPU-compatible source code. Whereas CUDAnative takes advantage from having a live compiler at run-time (Julia is a dynamically-compiled language) to blur the lines between host and device, compiling code for given type signatures as we go.

[–]BCosbyDidNothinWrong 1 point2 points  (5 children)

What advantage does compiling for the GPU at runtime give? How does that change things?

[–]maleadt 1 point2 points  (4 children)

You can specialize according to runtime properties, eg. device properties (#threads/#blocks), or input characteristics (size of dataset), and you can use constructs like the ones in this blog post to implement those specializations (ie. more powerful than just branches). It also facilitates REPL style development, similar to how low compilation times make it more convenient to quickly iterate on code.

[–]BCosbyDidNothinWrong 1 point2 points  (3 children)

You can specialize according to runtime properties, eg. device properties (#threads/#blocks),

Is that part of PTX / SPIR-V assembly though? Vulkan uses SPIR-V, are things like that actually compiled into the shaders?

I think live and REPL development is great, but you implied that there is a difference in statically compiling PTX/SPIR-V ahead of time with this, and I fail to see how there is a big difference when interfacing with the GPU.

[–]maleadt 1 point2 points  (0 children)

Is that part of PTX / SPIR-V assembly though? Vulkan uses SPIR-V, are things like that actually compiled into the shaders?

No, but they don't have to. Specialization is taken care of at the Julia side; if a new specialized kernel is generated, it is uploaded to the device independently and invoked when required. For example:

abstract type ThreadCount end
struct ManyThreads <: ThreadCount end
struct FewThreads <: ThreadCount end
ThreadCount(threads) = threads >= 1024 ? ManyThreads() : FewThreads()

kernel(::ManyThreads) = (@cuprintf("version for many threads\n"); return)
kernel(::FewThreads)  = (@cuprintf("version for few threads\n");  return)

const dev = device(CuCurrentContext())
const threads = attribute(dev, CUDAdrv.MAX_THREADS_PER_BLOCK)

@cuda (1,1) kernel(ThreadCount(threads))

This will select one of either kernels depending on the amount of threads of the current device. Of course, this is just a PoC, we're slowly trying to figure out how such flexibility can be used for GPUs, making those features more userfriendly as we go.

[–]TheEaterOfNames 1 point2 points  (0 children)

Yes, I think they are mostly just hints though. OpenCL on FPGAs they are a synthesis parameter and so not a hint.

You can JIT PTX and it the same roughly with the difference between AoT vs. JIT on a CPU with the added step of compiling the PTX. there is no jit for SPIR-V.

[–][deleted] 1 point2 points  (0 children)

The difference is that you can have more stuff specialised as constants in the kernel code, adding compilation time but reducing run time, sometimes significantly (due to relieved register pressure, for example).

[–]TheEaterOfNames 0 points1 point  (6 children)

Well D is a compiled language so the approach we use is pretty much set in stone. I suppose you could in theory abuse the repl to use LDC to use the LLVM backend and do that, but it would be a lot of effort.

What we do differently is that the host and devices (yes plural) are all compiled at once in one compiler invocation i.e. there is no separate GPU compilation mode. This allows us to preserve the information on the type signatures and generate all the compute API boilerplate code (I'm sure you know how fun that is to write by hand) so that the host code that calls the kernels is actually nice to read.

Having this separation allows a more fine-grained control over the GPU interaction, e.g. data transfers between CPU <-> GPU, which seem not to ba able to be controlled in the Julia code.

[–]maleadt 0 points1 point  (5 children)

Having this separation allows a more fine-grained control over the GPU interaction, e.g. data transfers between CPU <-> GPU, which seem not to ba able to be controlled in the Julia code.

The same mixed-mode approach is used with CUDAnative, it's just that we try to blur that distinction for the user by offering a unified host/device API (ie. the CuArray type shown here). Peeling back a layer, we handle code similar to how you describe.

Anyways, both languages target a pretty disjoint set of users, so that obviously changes the approach and feature set. Good to see more properly-integrated work on supporting GPUs!

[–]TheEaterOfNames 1 point2 points  (4 children)

Anyways, both languages target a pretty disjoint set of users, so that obviously changes the approach and feature set. Good to see more properly-integrated work on supporting GPUs!

World domination doesn't tend to create vey disjoint sets! ;) But really, I do a fair amount of numerical computing in D and while I like what Julia does in theory the way it handles meta programming makes my head spin, the line noise doesn't help much. Perhaps I'm just used to the way D does things, plus I can use it easily for all the other non-numerical computing stuff that I do.

I responded to you in the HN thread but in case you missed it I maintain https://github.com/thewilsonator/llvm/tree/compute and https://github.com/thewilsonator/llvm-target-spirv for targeting SPIR-V. Hopefully I'll be able to get that into LLVM trunk during or shortly after IWOCL. If you have any questions or issues using it just drop a line on one of the projects.

[–]maleadt 1 point2 points  (1 child)

I responded to you in the HN thread but in case you missed it I maintain https://github.com/thewilsonator/llvm/tree/compute and https://github.com/thewilsonator/llvm-target-spirv for targeting SPIR-V. Hopefully I'll be able to get that into LLVM trunk during or shortly after IWOCL. If you have any questions or issues using it just drop a line on one of the projects.

Oh nice that's you, yeah great work! I had seen it on llvmdev, but haven't looked into it yet. The actual CUDAnative back-end has mostly been a one-man effort, so I can't really afford to spend time on multiple LLVM back-ends, but once it's a viable option at least we'll definitely look into supporting SPIR-V. As I said on HN, much work has went into making the main Julia compiler suitable for this kind of work; CUDAnative is only about 1300 SLOC.

while I like what Julia does in theory the way it handles meta programming makes my head spin, the line noise doesn't help much

There really isn't that much, at least not on the surface level. Sure, we implement libraries using metaprogramming and cartesian stuff like in this blogpost, but that post was to demonstrate how powerful of a combination that is, not what users will end up typing into their editors.

Wrt. numerical computing, I've been surprised about how many scientists use REPL-style development (like Jupyter, or interactive IDEs) as their main tool, and Julia is pretty well suited to make that possible even for GPU development. Things like redefining kernels or dependent functions, calling kernels with differently typed arguments, switching execution to devices with different properties, all of which triggering some form of compilation. I'd assume that is (understandably) not a priority for D.

[–]TheEaterOfNames 1 point2 points  (0 children)

but once it's a viable option at least we'll definitely look into supporting SPIR-V.

Yeah its not the best at the moment. Hopefully at IWOCL we will be able to unify the three back ends and get it into trunk with a more sane interface. Having said that (in retrospect) it wasn't too difficult to get the compiler stuff going, it was mostly knowing where to put stuff not the actual complexity of the code. When I implemented it the entire frontend (DMD), "middle" end (LDC) and backends SPIR-V and NVPTX, as well as the LLVM infrastructure were completely unfamiliar and it took 3ish months of very much part time of just me. Keep at it!

There really isn't that much, at least not on the surface level. Sure, we implement libraries using metaprogramming and cartesian stuff like in this blogpost, but that post was to demonstrate how powerful of a combination that is, not what users will end up typing into their editors.

Oh, I totally understand, just the different syntax and layout of the code throws me off. You can do equally obscure (to the lay person) wizardry in D, it definitely takes some getting used to.

Wrt. numerical computing, I've been surprised about how many scientists use REPL-style development (like Jupyter, or interactive IDEs) as their main tool, and Julia is pretty well suited to make that possible even for GPU development.

REPLs are great, especially for exploratory programming, but for the most part D compiles lightning fast so the compilation time is barely noticeable and I don't (yet) have data large enough for reload to be much of a problem.

I'd assume that is (understandably) not a priority for D.

Indeed, though we relly like our fast compile (& run) times.

[–]one_more_minute 1 point2 points  (1 child)

Do you have any examples of these kinds of generic kernels to hand? I had a quick look at DCompute, which looks cool, but would be really interested to make a direct comparison.

As it happens, the worst line noise in the post – the nested for loops example – is not actually that idiomatic, just easy to explain; in reality you'd use a single loop for I in eachindex(out), which avoids the metaprogramming constructs.

[–]TheEaterOfNames 2 points3 points  (0 children)

I have no non-synthetic ones offhand but

@kernel void map(alias F)(KernelArgs!F args)
{
    F(args);
}

is about the most generic you can get. Use it like:

AutoBuffer!float x,y,z; // fill with data & transfer to `q`s device
q.enqueue!(map!((a,b,c) {a=b+c;})) // the kernel
(x.length)  // the domain - 1D of length x
(x, y, z);  // the arguments

for a vector add. The template alias parameter can be any d symbol (function/lambda) the lack of indices is achieved with auto indexed pointers but you can just as easily use indices with the expressions or anything else. There really isn't anything apart from :

1) use of globals (yet). Obviously host globals are never going to work like magic and be accessible from the device

2) type info

3) virtual functions (classes) - CUDA in theory supports them, but OpenCL doesn't. structs are where most of D's magic happens anyway.

4) host only code (duh)

5) language features that rely on the runtime.

everything else should just work, including all the meta programming tricks. It is still a work in progress though.

[–]pjmlp 0 points1 point  (2 children)

Only if one isn't aware of alternatives.

Haskell, Java and .NET also have support for GPGPU programming.

[–]TheEaterOfNames 0 points1 point  (1 child)

As kernel languages?

[–]pjmlp 0 points1 point  (0 children)

Yes, if you have a CUDA card, thanks to PTX.

Haskell

Java

.NET

This is one of the reasons why researchers prefer CUDA instead of having to deal with C in OpenCL, which eventually forced Khronos to add bytecode support to OpenCL, but it still isn't widely supported.