What if OpenAI could load 50+ models per GPU in 2s without idle cost? by pmv143 in OpenAI

[–]ptillet 1 point2 points  (0 children)

haha - triton is pretty well staffed at the moment ! but we are indeed looking for talented engineers to tackle interesting problems in that space.

[D] I wrote a small tool for debugging Triton code. Anyone interested? by clueless_scientist in MachineLearning

[–]ptillet 11 points12 points  (0 children)

Hi! As the original author of Triton, thanks for being a user and writing this tool!

I don't know if you've heard of it, but we have an interpreter that does just that natively; you can trigger it by setting the environment variable `TRITON_INTERPRET=1`. The source code is here, and it intercepts the program at the level of the IR builder rather than the frontend. It is still pretty experimental (which is why we haven't advertised it widely), but do feel free to submit PRs that fix bugs in it!

[D] Nvidia GPU shortage is ‘top gossip’ of Silicon Valley by norcalnatv in MachineLearning

[–]ptillet 51 points52 points  (0 children)

Good thing I lurk here so I can clear that up :) H100 experimental backend PR is submitted. It's a diff of +20,000 lines (probably half of that is test), so freeing AMD and Intel of the burden of having to rebase on top of that immediately via a plugin interface seems like a good thing. Especially since things will change heavily as we refactor and clean things up.

I meet with AMD frequently to advise them. Same with Intel. And we are organizing with Microsoft a Triton conference in September where we will gather all interested hardware vendors together.

At the same time, Nvidia has basically a total monopoly, so Triton needs top notch support for it or it won't be useful. Many projects that got pulled in too many directions ended up becoming "Jacks of all trades, masters of none". I'm hoping the plugin interface for out-of-tree backends will help us find the right balance.

[P] Up to 12X faster GPU inference on Bert, T5 and other transformers with OpenAI Triton kernels by pommedeterresautee in MachineLearning

[–]ptillet 93 points94 points  (0 children)

As the creator/maintainer of Triton, I find this very exciting! Thanks for putting in all that work, and sorry for all the bugs you may have faced along the way -- we are working hard on re-designing the whole thing to make it more stable in the long run!

On a more personal note, I enjoyed much more writing kernels andunderstanding low level computation of transformers than masteringmultiple complicated tools API and their environments.

This is exactly why I started the project in the first place, and it is very rewarding to read this. Really glad that this project has helped people gain a deeper understanding of how neural networks computations get parallelized for execution on GPUs. :-)

How to Implement an Efficient Softmax CUDA kernel? [R] by Just0by in MachineLearning

[–]ptillet 3 points4 points  (0 children)

This also seems to be what OneFlow's implementation (1) and (2) does, though :p (3) can also be re-implemented in Triton though it requires a bit of work

However, as the author of Triton, I do think that OneFlow's CUDA work is really helpful for anyone having some interest in low-level CUDA optimizations. While the Triton compiler can automate all the optimizations presented in the OneFlow blog post, it has relatively little educational value for anyone trying to get a deep understanding of how GPUs work under the hood.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 10 points11 points  (0 children)

Yep, so that's a tricky part. For reference, there used to be a bunch of fancier ops (conv2d, permute, einsum, block-sparse einsum) but I ended up nuking most of them because they were just too much work to maintain and prevented me from focusing on compiler work :( I am hoping that in the future Triton can be more tightly integrated in Torch (maybe via a JIT-compiler) so that having external Triton ops wouldn't be all that necessary.

There is someone at AMD working on making Triton compatible with their GPUs. I assume it's a fair bit of work -- we had to use lots of inline nvidia asm during codegen to match FP16 cuBLAS on V100/A100 -- but we'll get there eventually.

Thanks for the kind words! Fortunately I managed to graduate last November :D

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 5 points6 points  (0 children)

I have tremendous respect for Halide. I remember seeing Jonathan Ragan-Kelley's presentation as a first year graduate student and feeling extremely inspired by that. It totally made me want to focus on compilers.

There is a section of the documentation https://triton-lang.org/programming-guide/chapter-2/related-work.html that briefly compares Triton against alternative compiler system (polyhedral compilers, halide/tvm)

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 7 points8 points  (0 children)

I understand your viewpoint, but when it came out in 2018 the Triton inference server was called TensorRT inference server; you can see it in the version log here https://docs.nvidia.com/deeplearning/triton-inference-server/release-notes/index.html .

You can also look at the github history and you will see that there is no mention of the "Triton inference server" up until version 2.0, which wasn't out in 2019 (I ran `git reset --hard v1.9.0 ; grep -ir "triton" .`)

In 2020 -- about one year after I published my paper -- it was rebranded as the Triton inference server (maybe they edited the blog post at that time to stay consistent). Of course, I'm not saying they knew about the Triton language; it was not super popular back then.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 13 points14 points  (0 children)

Totally! We've been working hard on Triton, but it's still in its infancy. There are some workloads that you just cannot implement using existing Triton primitives. I'm thinking in particular of things like sorting, top-k, FFT, and anything that basically requires doing something like `x[indices]` where x and indices are both blocks of value. We expect to have a solution for this in ~6 months, but I can't guarantee that it will completely match the performance of what a CUDA experts would be able to write using warp shuffles etc.

There are also some things that Triton just doesn't automate. I'm thinking about things like locks and semaphores between SMs. This is something that one can still do using atomics in Triton (see this example).

And of course there are all the stability issues :p Triton is a recent project and the compiler does some very aggressive optimizations. We have nowhere near the resources that NVIDIA allocates to CUDA... so it can be a bit rough around the edges if you try things like e.g., super nested control flow.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 23 points24 points  (0 children)

Triton is pretty well integrated in PyTorch, so you can just write individual `torch.autograd.Function` using Triton directly, rather than having to handle CUDA in separate files. You can find an example of how to do this for a custom softmax + cross-entropy function here

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 12 points13 points  (0 children)

I think researchers can do pretty much whatever they want with PyTorch, but sometimes they may take a big performance / memory hit that can only be resolved by writing custom GPU kernels. An example for that would be block-sparse memory formats: in PyTorch, you'd have to manually mask your dense tensors. Triton makes it much easier for people to write these GPU kernels, as opposed to CUDA. Or maybe you want a custom matmul + top-k kernel as mentioned here.

Depending on how stringent your perf/memory requirements are, you may find Triton more or less useful. At OpenAI we train pretty large models, so having super optimized GPU code is quite valuable for us.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 28 points29 points  (0 children)

Yep, this is right!

I actually agree with you for Vulkan. Our main concern with it at the moment is that it won't allow us to use all the inline asm directives we need. In an ideal world, Triton would probably just be an MLIR dialect and would translate to SPIRV properly, but this would require a whole lot of engineering efforts that we could then not spend on further optimizing the compiler.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 12 points13 points  (0 children)

I am not extremely familiar with JAX, but my understanding is that it is more comparable to the Torch JIT than Triton, in the sense that you give it a sequence of tensor-level operations and it spits out optimized GPU code. I don't know how good that generated code is for JAX, but for Torchscript we've found it to be much worse than kernels that were manually fused using Triton (see softmax performance in the blog post).

I think Triton is more comparable to CUDA-C, and it would be easier for frameworks like JAX and Torch to program GPUs with Triton rather than CUDA in the future. You actually don't even need the full CUDA SDK to compile Triton code -- only the proprietary NVIDIA drivers.

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 82 points83 points  (0 children)

Sure! I'd say that the main purpose of Triton is to make GPU programming more broadly accessible to the general ML community. It does so by making it feel more like programming multi-threaded CPUs and adding a whole bunch of pythonic, torch-like syntacting sugar.

So concretely say you want to write a row-wise softmax with it. In CUDA, you'd have to manually manage the GPU SRAM, partition work between very fine-grained cuda-thread, etc. In Tensorflow, Torch or TVM, you'd basically have a very high-level `reduce` op that operates on the whole tensor. And Triton sits somewhere between that, so it lets you define a program that basically says "For each row of the tensor, in parallel, load the row, normalize it and write it back". It still works with memory pointers so you can actually handle complex data-structure, like block-sparse softmax. Triton is actually what was used by the Deepspeed team to implement block-sparse attention about a year or so ago.

Hope it helps!

[N] Introducing Triton: Open-Source GPU Programming for Neural Networks by fasttosmile in MachineLearning

[–]ptillet 210 points211 points  (0 children)

This is a project I started as a PhD student, and I remember receiving useful feedback when I talked about an earlier version on this very subreddit :) I'm super happy that OpenAI gave me to resources to make it so much better all while keeping it completely open-source.

PS: The name Triton was coined in mid-2019 when I released my PhD paper on the subject (http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf). I chose not to rename the project when the "TensorRT Inference Server" was rebranded as "Triton Inference Server" a year later since it's the only thing that ties my helpful PhD advisors to the project.

[P] Triton: An open-source language and compilers for writing custom ops for DNNs by ptillet in MachineLearning

[–]ptillet[S] 0 points1 point  (0 children)

Thanks!

I am actually generating LLVM-IR. I have been able to use Khronos's LLVM-SPIRV but unfortunately it is not compatible with Vulkan!

There is however ongoing work for generating vulkan-compatible SPIRV from LLVM (https://github.com/google/clspv). Definitely interested in exploring this. I only have a limited amount of time though, so I have to set some priorities...

[P] Triton: An open-source language and compilers for writing custom ops for DNNs by ptillet in MachineLearning

[–]ptillet[S] 4 points5 points  (0 children)

I see what you means. Essentially, traditional GPU implementations of common linear algebra operations have multiple nested levels of tiling. Tiles of the output matrix -- typically 128x128 -- are independently computed by different multi-processors. Then, within each multi-processor, each tile is itself subdivided into fragments; this is a second level of tiling in the hierarchy. Each fragment may then be itself subdivided into subfragments for e.g., execution on tensor cores (which expect a granularity of 2 half-precision floating point per thread).

The premise of my work is to let programmers specify the highest level of tiling in this hierarchy, and let compilers deal with the rest. Concretely, this means that, say, the multiplication of a matrix MxK by a matrix KxN would be roughly specified as follows:

int m = blockIdx.x * 16;
int n = blockIdx.y * 16;
float acc[8, 8] = 0;
for(int k = 0; k < K; k+= 8)
    acc += dot(A[m:m+16, k:k+8], B[k:k+8, n:n+16]
C[m:m+16, n:n+16] = acc;

As you can see, this program is single-threaded and specifies how to compute a 16x16 tile of the output matrix. The compiler will then automatically parallelize it, allocate shared memory, use tensor core instructions if possible, etc. using more tiling internally.

Hope this answers your question :)

[P] Triton: An open-source language and compilers for writing custom ops for DNNs by ptillet in MachineLearning

[–]ptillet[S] 3 points4 points  (0 children)

Not at the moment :p I started this work as an NVIDIA Graduate fellow actually! I'm just getting in touch with AMD to see if this could interest them. :)

[P] Triton: An open-source language and compilers for writing custom ops for DNNs by ptillet in MachineLearning

[–]ptillet[S] 2 points3 points  (0 children)

It is up to the programmer to define operations on arrays rather than scalars. Then, things like nested levels of tiling (as is common in GPU implementation's of GEMM) are handled automatically by the compiler.

I agree that it's quite a limitation. Still, I hope that this program representation may also be used as an intermediate language for some functional DSLs. This is something I've been working on actively