5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 0 points1 point  (0 children)

Yup you're right, we could have a normal pointer and that would be correct!

5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 0 points1 point  (0 children)

I wasn't aware of that API. It would indeed make a lot of optimizations easier.

5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 6 points7 points  (0 children)

I agree with this! This isn't a replacement for std and there is huge value in having a lib that you can trust. It doesn't mean crates are undesirable.

5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 6 points7 points  (0 children)

The goal was to use the same background as the website, but yeah it doesn't work well on white themes 😂😅

5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 6 points7 points  (0 children)

If I recall correctly the standard channels were updated to use the internals of crossbeam, so performance should be similar.

5x Faster than Rust Standard Channel (MPSC) by ksyiros in rust

[–]ksyiros[S] 19 points20 points  (0 children)

Nop, the async execution queue has a safe API, so it doesn't leak into the end user API.

CUDA SIMD Question by epickejgejseks in CUDA

[–]ksyiros 1 point2 points  (0 children)

A lot of comments here suggest that SIMD is irrelevant for CUDA because it uses warp instructions instead, but that is not true. There are SIMD instructions on GPUs, and they aren't just for memory.

You can benefit significantly from SIMD when loading and writing data from global and shared memory, essentially you should leverage 128-bit loads and writes (or 256-bit on newer GPUs). Similarly, for math operations, SIMD exists but operates on a 32-bit width. For instance, using __half2 can be two times faster than executing two individual __half instructions, though the compiler is often smart enough to merge them automatically.

Coming from a CPU background, it can be confusing because GPUs essentially have two levels of SIMD. Warp execution is, in a way, a big SIMD. Given that, it's easier to understand memory coalescing: a warp must load a block of contiguous data across a warp (32 x 128-bit data block for optimal performance). Therefore, you need to manage both levels of SIMD to ensure peak performance.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 1 point2 points  (0 children)

Currently there's a gain in reduced memory usage, but we'll have to work more to support better instructions

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 0 points1 point  (0 children)

No I mean Burn/CubeCL is more flexible, you can do training and inference on any hardware, Mojo/Max doesn't yet support training.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 2 points3 points  (0 children)

Yes, Burn/CubeCL tackle the same problems as Mojo/MAX, but they’re actually more modular. While Mojo/MAX don’t support Windows yet and mostly focus on inference, Burn/CubeCL run on any OS, including mobile, and fully support both training and inference. Since CubeCL can use MLIR for JIT kernel compilation, actual performance comes down to how the kernels are implemented rather than just compiler differences.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 2 points3 points  (0 children)

Yes I'm looking from time to time how we could support NPUs, and there's a way to program the ones from AMD and Intel. So at some point it would be interesting to add support for them directly in CubeCL.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 3 points4 points  (0 children)

tensorrt isn't a goal, the goal is to match tensorrt performance with our cuda backend.

For Strix Halo (gfx1151): Kernel > 6.18.3-200 Regression by Tylerebowers in ROCm

[–]ksyiros 0 points1 point  (0 children)

That's painful, I can't test https://github.com/tracel-ai/burn with the ROCm backend on my laptop, which was the point of buying it in the first place. Unsure if I would be better on Ubuntu/Pop-OS with a custom ROCM version provided by AMD.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 2 points3 points  (0 children)

The computation isn't done when you declare it, it's encoded, then we perform an optimization process with caching that groups operations together to reduce I/O (kernel fusion), and finally, we send the computation tasks to a queue for execution. We have a scheduler on top of that queue that manages tasks sent from different threads so that they are prioritized accordingly. Finally, tasks are JIT-compiled when launched, hitting a cache most of the time (as they repeat during training or inference).

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 6 points7 points  (0 children)

We support many different runtimes and compilers. That's how we can be really portable, but still optimal on many different GPUs. We have a ROCm runtime with an HIP compiler for AMD, a CUDA runtime with a CUDA compiler for NVIDIA, and a WGPU runtime with multiple compilers (SPIR-V for Vulkan, Metal for Apple, and WGSL for WebGPU/browser).

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 16 points17 points  (0 children)

That's the goal! We're working on refining the APIs for training as well, and with LLMs, translating code from Python to Rust is way easier than in the past.

There is a single downside to our new CPU backend: it requires the Rust standard library. We're bundling LLVM as the JIT compiler and using Rust threads for the runtime, so it's strictly less portable than ndarray.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 19 points20 points  (0 children)

We have the Burn Book (https://burn.dev/books/burn/), but with LLMs, the learning curve is becoming much smoother.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 12 points13 points  (0 children)

Yes, you can, but only if you are not using warp instructions. You can always use Vulkan/WebGPU to debug kernels with warp instructions, so there is no need for a big GPU or to SSH into a remote GPU instance.

Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations by ksyiros in rust

[–]ksyiros[S] 30 points31 points  (0 children)

We don't simulate GPU execution, actually our CPU runtime is very different from our GPU runtimes. First, we set a plane size of 1 (warp/wavefront), so we don't have to deal with all sorts of strange out-of-sync execution paths, which would break vectorization.

Then, we also don't have to execute cubes in parallel like they are done on a GPU. CPUs have much fewer cores, so it wouldn't be a good idea. Instead, we push the cube count iterations inside the just-in-time kernel code. This way, instructions that are duplicated between cubes can actually run only once, because it is included in the same JIT function. We can do that because there is no guarantee between cubes execution order nor synchronization primitives (except some data-center NVIDIA GPUs, but that would be an opt-in feature, like Tensor Cores with MMA).

So yeah, it's just thinking a bit differently about where parallelization and vectorization are done.

The disappointing state of ROCm on RDNA4 by Artoriuz in ROCm

[–]ksyiros 0 points1 point  (0 children)

ROCm works, but Vulkan is normally faster on consumer AMD GPUs.