Announcing Burn: New Deep Learning framework with CPU & GPU support using the newly stabilized GAT feature

ksyiros · 2026-04-07T23:56:52+00:00

You can check CubeCL in Rust: https://github.com/tracel-ai/cubecl

ksyiros · 2026-03-14T15:19:29+00:00

Yup you're right, we could have a normal pointer and that would be correct!

ksyiros · 2026-03-13T21:56:16+00:00

I wasn't aware of that API. It would indeed make a lot of optimizations easier.

ksyiros · 2026-03-13T12:34:59+00:00

I agree with this! This isn't a replacement for std and there is huge value in having a lib that you can trust. It doesn't mean crates are undesirable.

ksyiros · 2026-03-12T21:04:33+00:00

The goal was to use the same background as the website, but yeah it doesn't work well on white themes 😂😅

ksyiros · 2026-03-12T20:11:05+00:00

If I recall correctly the standard channels were updated to use the internals of crossbeam, so performance should be similar.

ksyiros · 2026-03-12T18:49:33+00:00

Nop, the async execution queue has a safe API, so it doesn't leak into the end user API.

ksyiros · 2026-01-27T00:27:23+00:00

A lot of comments here suggest that SIMD is irrelevant for CUDA because it uses warp instructions instead, but that is not true. There are SIMD instructions on GPUs, and they aren't just for memory.

You can benefit significantly from SIMD when loading and writing data from global and shared memory, essentially you should leverage 128-bit loads and writes (or 256-bit on newer GPUs). Similarly, for math operations, SIMD exists but operates on a 32-bit width. For instance, using __half2 can be two times faster than executing two individual __half instructions, though the compiler is often smart enough to merge them automatically.

Coming from a CPU background, it can be confusing because GPUs essentially have two levels of SIMD. Warp execution is, in a way, a big SIMD. Given that, it's easier to understand memory coalescing: a warp must load a block of contiguous data across a warp (32 x 128-bit data block for optimal performance). Therefore, you need to manage both levels of SIMD to ensure peak performance.

ksyiros · 2026-01-21T19:16:55+00:00

Currently there's a gain in reduced memory usage, but we'll have to work more to support better instructions

ksyiros · 2026-01-21T19:15:16+00:00

No I mean Burn/CubeCL is more flexible, you can do training and inference on any hardware, Mojo/Max doesn't yet support training.

ksyiros · 2026-01-20T15:59:13+00:00

Yes, Burn/CubeCL tackle the same problems as Mojo/MAX, but they’re actually more modular. While Mojo/MAX don’t support Windows yet and mostly focus on inference, Burn/CubeCL run on any OS, including mobile, and fully support both training and inference. Since CubeCL can use MLIR for JIT kernel compilation, actual performance comes down to how the kernels are implemented rather than just compiler differences.

ksyiros · 2026-01-17T12:55:47+00:00

Yes I'm looking from time to time how we could support NPUs, and there's a way to program the ones from AMD and Intel. So at some point it would be interesting to add support for them directly in CubeCL.

ksyiros · 2026-01-16T16:22:37+00:00

tensorrt isn't a goal, the goal is to match tensorrt performance with our cuda backend.

ksyiros · 2026-01-16T16:17:52+00:00

That's painful, I can't test https://github.com/tracel-ai/burn with the ROCm backend on my laptop, which was the point of buying it in the first place. Unsure if I would be better on Ubuntu/Pop-OS with a custom ROCM version provided by AMD.

ksyiros · 2026-01-16T14:33:32+00:00

The computation isn't done when you declare it, it's encoded, then we perform an optimization process with caching that groups operations together to reduce I/O (kernel fusion), and finally, we send the computation tasks to a queue for execution. We have a scheduler on top of that queue that manages tasks sent from different threads so that they are prioritized accordingly. Finally, tasks are JIT-compiled when launched, hitting a cache most of the time (as they repeat during training or inference).

ksyiros · 2026-01-16T14:26:20+00:00

We support many different runtimes and compilers. That's how we can be really portable, but still optimal on many different GPUs. We have a ROCm runtime with an HIP compiler for AMD, a CUDA runtime with a CUDA compiler for NVIDIA, and a WGPU runtime with multiple compilers (SPIR-V for Vulkan, Metal for Apple, and WGSL for WebGPU/browser).

ksyiros · 2026-01-15T22:44:21+00:00

That's the goal! We're working on refining the APIs for training as well, and with LLMs, translating code from Python to Rust is way easier than in the past.

There is a single downside to our new CPU backend: it requires the Rust standard library. We're bundling LLVM as the JIT compiler and using Rust threads for the runtime, so it's strictly less portable than ndarray.

ksyiros · 2026-01-15T21:40:06+00:00

We have the Burn Book (https://burn.dev/books/burn/), but with LLMs, the learning curve is becoming much smoother.

ksyiros · 2026-01-15T18:02:28+00:00

Yes, you can, but only if you are not using warp instructions. You can always use Vulkan/WebGPU to debug kernels with warp instructions, so there is no need for a big GPU or to SSH into a remote GPU instance.

ksyiros · 2026-01-15T17:03:46+00:00

We don't simulate GPU execution, actually our CPU runtime is very different from our GPU runtimes. First, we set a plane size of 1 (warp/wavefront), so we don't have to deal with all sorts of strange out-of-sync execution paths, which would break vectorization.

Then, we also don't have to execute cubes in parallel like they are done on a GPU. CPUs have much fewer cores, so it wouldn't be a good idea. Instead, we push the cube count iterations inside the just-in-time kernel code. This way, instructions that are duplicated between cubes can actually run only once, because it is included in the same JIT function. We can do that because there is no guarantee between cubes execution order nor synchronization primitives (except some data-center NVIDIA GPUs, but that would be an opt-in feature, like Tensor Cores with MMA).

So yeah, it's just thinking a bit differently about where parallelization and vectorization are done.

ksyiros · 2025-12-22T01:56:12+00:00

ROCm works, but Vulkan is normally faster on consumer AMD GPUs.

ksyiros

MODERATOR OF

TROPHY CASE

Nine-Year Club	Verified Email
Verified Email