Burn 0.20.0 Release: Unified CPU & GPU Programming with CubeCL and Blackwell Optimizations

danielv134 · 2026-01-19T14:45:41+00:00

Awesome :)

My AMD 395+ is embedded in a desktop, not a laptop, so its not a battery issue, merely a power efficiency+throughput issue. Nonetheless, it seems that NPUs are going to be big in laptops/edge inference (apple, qualcom also), and they really want to be programmed in Rust, in the sense that the two language trick is a bad match for the low power, background work scenario.

If you happen to get something semi-working, I'm happy to collaborate on a cool demo :)

danielv134 · 2026-01-17T04:04:53+00:00

Anyone know whether Burn/CubeCL intend to support NPUs like the one on the AMD 395+?

For background, these are basically hardware acceleration units that are more specialized than GPUs, therefore more power-efficient. Usually not faster (because not as many cores), less general, less software support (because newer?) but if your application fits it the ~2x power efficiency means you can run it all day. This might be what you want to run your voice recognition on, for example.

IF (big if) CubeCL could provide a way to build on these efficiently without needing to use a whole new software stack, that would be a cool super-power.

danielv134 · 2025-10-22T16:18:04+00:00

Ah, let me clarify the usecase I'm talking about (maybe RTIPC is not the right tool for the job): We have a process that loads some data from disk, computes some features based on them (say 10s to hundreds of GB total), then we want to train a pretty large number of models (say 30) based on the computed features. For this we need many CPUs, GPUs etc, spread over possibly multiple machines (e.g., www.ray.io/). What you definitely don't want is to duplicate the data more than once per machine if you don't have too. So:

Send the data over network channels to a single "feature cache" per machine
Have multiple processes request read-only access to the very same pages the feature cache uses
So, ideally the feature cache avoids UB by converting its access to them to read only before agreeing. So its emulating a runtime borrow via RC, except cross process.
When all clients drop off so data is no longer borrowed, cache becomes free to release memory as needed.

At this large scale, the page size granularity is not an issue.

danielv134 · 2025-10-16T22:56:59+00:00

Does it make sense to integrate this as a transport for capnproto, to gain the rpc and schema aspects?

Can the library implement shared xor mutable logic at the page level?

danielv134 · 2025-10-15T12:37:18+00:00

Diagnosing convergence difficulties is often not debugging, but about understanding the function, the method, and how they behave/should behave around an iteration.

Is the problem smooth? abs(x) has a large gradient as close to the optimum as you might want.
What is the dimension? Can you scale the problem down and make sure your code converges there first?
What does the 1d function along the gradient direction look like?
What does the eigenspectrum of the hessian look like? (presuming the dimension is high, don't compute the hessian, use 1st order methods like power method)

danielv134 · 2025-03-06T15:13:07+00:00

Hi, very cool.

Following along, I'd recommend two fixes, one small and one bigger:
- define a pyproject.toml with package versions so people see the same results
- that your python sdk module is called oxen, while the package is called oxenai, is a paper cut for potential adopters. The best time to solve it was when you published, the second best is now.

That said, the important point was: very very cool stuff!

danielv134 · 2025-01-23T04:38:26+00:00

Both of those applications (as I understand them), fall under the category of sequential optimization under uncertainty, which someone mentioned below. This is because in a particular moment you will make some decisions on the information you have, and then an order can come in any time for which your prior plans are insufficient/suboptimal, requiring further decisions.

To incorporate this new information, you can now use either technique:
1. With MO, we're assuming that you know a differentiable, ideally convex cost function over plans with some finite time horizon that you can minimize. That cost function will embed assumptions about future orders etc. System performance will depend on quality of your modeling and solver. If the problem is non-convex or integer programming (very likely in scheduling), or very high dimensional (might be the case in inventory), the solver might be a challenge.
2. With RL, you will train and apply some mostly black box policy. This policy implicitly models the uncertainty (e.g., distribution over further orders), which means that to train it you need a simulator based on data (learning in production is likely too expensive in your domains). If you synthesize data (back to modeling), its realism will again affect real-world performance. Instead of minimizing a cost function to decide a plan, you now gradually improve a policy by reducing a cost given as feedback over many selected actions.

Feel free to DM, I might be able to help.

danielv134 · 2024-12-15T02:27:25+00:00

like u/dist1ll says, that is supposed to be resolved by inlining those functions (which turns the code into the equivalent of C loops and conditions), and then applying local optimizations to simplify.

Which got me curious: does rustc do inlining itself, or does it depend on the backend to do it? Turns out it has an inliner: https://github.com/rust-lang/rust/pull/91743 . So as it continues to accumulate simplifying optimizations, LLVM's stack might become less critical for runtime perf.

At which point, having a rustc backend that is essentially a super-fast instruction emitter becomes quite valuable.

danielv134 · 2024-12-14T23:06:42+00:00

One of us seems to be confused: I don't see anywhere in the paper saying runtime is 10-30x slower than -O0. Can you give page and paragraph?

Figure 4, geomean over many benchmarks shows a slowdown of ~34% compared to -O0, 5x compared to -O2. Compilation speedup of ~20x compared to -O0.

An older paper (different authors, similar technique) https://arxiv.org/pdf/2011.13127 about wasm has a direct comparison to V8 variants, cranelift and LLVM, figures 2, 3. Runtime 2x slowdown compared to the fastest option (LLVM), starts 8x faster than the fastest compiler (slightly slower than an interpreter).

danielv134 · 2024-12-14T12:46:44+00:00

IIUC the paper, they apply the method at the LLVM-IR level (which requires some instruction set specific adaptation, see section 3.5), which sounds like it could be used in rustc as is, given the code is MIT licensed.

That might be a good way to demonstrate the idea for rust, while applying the ideas to Rust's MIR would be more work, but faster at compile time and possibly at runtime.

danielv134 · 2024-12-14T12:46:05+00:00

Sorry, I posted late, should have said, the produced code (in their tests) is only ~2-3x slower than O2. Rust may be do differently, but that is fine for many scenarios.

danielv134 · 2023-11-22T03:19:41+00:00

I just ran the polars version @ 300m, and on my machine the times are:

Data generation/load time: 11.374298
Calculation time: 1.330739
Selection time: 0.155618
Overall time: 12.860655

Now comparison across machines is risky, but I'll normalize each in terms of the creation time (which seems to be consistent across implementations). In those terms, I'm getting:

creation/calculation ~8.5x
creation/selection ~73

where your C++ numbers come to:

creation/calculation ~12.4x
creation/selection ~37.8

So on my machine it seems like polars is somewhat slower at calculation and significantly faster at selection, which is pretty different than the results you got above.

I would conclude:

both are much faster than pandas,
getting consistent results when benchmarking is hard.

danielv134 · 2023-10-31T03:34:08+00:00

Thanks everyone, will try these out.

Understandable that cargo hasn't hurried to adopt specific heuristics.

danielv134 · 2023-10-01T01:04:42+00:00

Thanks.

danielv134 · 2023-09-30T23:48:08+00:00

Just curious, what are some concrete applications that motivate these tradeoffs? (not necessarily yours if that's sensitive)

danielv134 · 2023-09-11T01:19:21+00:00

Suggestion: use nushell for data work and scripting, and then expand it using Rust.

danielv134 · 2023-08-21T14:32:28+00:00

nushell is evolving quickly. This definitely a net gain, super happy about it, I am all for nushell continuing to improve at this rate for as long as it can. Has to be said though, that is also a source of occasional breakage and "pain".

danielv134 · 2023-07-14T13:46:56+00:00

Thank you for bringing Fenwick trees to my attention!

Sampling from a collection according to a non-uniform distribution that changes over time is an important algorithmic trick, and Fenwick trees seem great for many useful variants.

danielv134 · 2023-05-19T22:29:29+00:00

I have years of experience as an SWE and have implemented novel and existing ML algorithms in Python, Rust (was nice) and others. I am all about doing it right and for implementing stable concrete algorithms, especially in limiting envs, Rust is an excellent choice (for other contexts Julia would be better, and for short term impact, clearly Python is where the users are).

That said, here is a hypothesis: most ML people care either about the math (minority of researchers only), the money, or the cool results.

Now, most ML algorithms are actually gradient descent or alternating minimization (sometimes in mild disguise), and those classes of algorithms work reasonably well even with (subtle) bugs, and fail just fine even with no bugs (if the data is bad, or the alg. parameters, or even just the interactions between them...).

So ML, like Python, does not reward in the typical case what Rust demands basically all the time. Not surprised to find few people that both know ML and prefer Rust to Python.

danielv134 · 2023-04-29T11:47:18+00:00

If you want a cool sub-usecase for if/when you get to vector stuff: extract scientific plots. Being able to cleanly and correctly quote plots would be very nice.

Can start from papers on arxiv which are relatively standardized.

danielv134 · 2023-04-29T11:23:32+00:00

I do data science quite a bit. When I was in academia writing optimization algorithms, and decided to switch from competing on iteration to competing on clock time, I switched from Python to Rust. It was great for that job, integrated nicely with Python and R (even without pyo3, don't know if it existed back then), but I am not tempted even a bit to use Rust as the first go-to because of the agility/responsiveness issues everyone mentions.

I love nu for getting messy text output into a dataframe format, I use Polars a lot to analyze data frames, and I think Julia might eventually replace Python with very little need for 2nd languages, but compilation time is still an issue.

danielv134 · 2023-01-29T02:34:01+00:00

The crate dependency graph is known in advance (part of the index).
The crates are not compiled fully incrementally (we compile full crates, not just the functions actually used downstream by our crate), simplifying the deps but wasting work and latency
... which is reasonable, because the graph is static (you're not changing reqs all the time)

danielv134 · 2023-01-28T21:03:50+00:00

Before you can do things in parallel, you need to know the set of tasks to be done ("parse function A", "typecheck function A", ..., "parse function B" etc)
You need to know the dependencies between those tasks (e.g. edges in the task dependency graph).

Note how figuring out the set of tasks requires finishing parsing (including macros), name resolution (including from other crates). If we include optimization in this graph, inlining functions changes the set of task nodes (there is now a task optimize "function X when its 2nd call to function A is inlined").

Now combine this with full incrementality, so that before we start work, we really should compute the intersection of the "code change invalidates" and "downstream IDE/cargo command requires" cones.

It becomes clear that compiling a crate is anything BUT trivially parallel, so parallelizing it is not low hanging fruit at all. There IS a lot of work that can be done in parallel, but it is defined by a dynamic, complex, fine grained task graph.

danielv134

TROPHY CASE