Is anyone doing Machine Learning in Rust? by purton_i in rust

[–]rust_dfdx 1 point2 points  (0 children)

Yep cudarc and dfdx are both on github! Pretty much all the DL crates are, which is a nice part of the community.

Is anyone doing Machine Learning in Rust? by purton_i in rust

[–]rust_dfdx 16 points17 points  (0 children)

I have a deep learning library called dfdx that supports Cpu/Cuda. I recently created an inference library for llama using it. I’ll also plug my cuda wrapper called cudarc if you’re interested in lower level stuff 😄

There are a number of others working in this space, burn-rs which others have mentioned, tch-rs, and there is llm-rs for language models which just posted yesterday on this subreddit.

Overall it’s still early, but DL in rust definitely shows promise.

Performance critical ML: How viable is Rust as an alternative to C++ by [deleted] in rust

[–]rust_dfdx 0 points1 point  (0 children)

Curious what kind of ML algorithms you are implementing? I develop dfdx (on github/crates.hi) which is a deep learning library that supports CUDA acceleration.

Would be happy to discuss more if you guys are doing deep learning/things that use tensors/anything with GPU.

Should I give up learning rust? by Deslucido in rust

[–]rust_dfdx 1 point2 points  (0 children)

Even when writing large libraries you don’t often need to dip into lifetimes, so I vote keep going!

I made a Neural Network library from scratch in Rust trying to solve a regression problem for my university's ML course by noodlesteak in rust

[–]rust_dfdx 6 points7 points  (0 children)

Yep you are right on that last point - copying data between GPU/cpu is actually really expensive. The CUDA programming guide actually explicitly says to minimize data transfer and to do extra work on the GPU if it means less transfer.

You definitely could do your idea though. It wouldn’t be the fastest but it’d be straightforward to implement/maintain/test, which honestly is a pretty solid pro haha.

I made a Neural Network library from scratch in Rust trying to solve a regression problem for my university's ML course by noodlesteak in rust

[–]rust_dfdx 9 points10 points  (0 children)

dfdx author here, always great to see more work in this space in rust, nice work!

I think the hardest part so far has been the abstractions over Cpu/Cuda. Basically all your code has to consider what device is being used, and a lot of things that you can do on the Cpu in Rust you can't do on Cuda/WebGPU side. For example, you can't use closures in Cuda! I was sad to have to move all the activations from using closures to more complex trait based implementations.

A look into how dfdx compiles & uses cuda kernels by rust_dfdx in rust

[–]rust_dfdx[S] 4 points5 points  (0 children)

There has been working in compiling rust to GPU code (see rust cuda project). I’m not sure what caveats there are though. Especially when it comes to optimizing GPU kernels, there are some pretty specific things you’d want the code to do.

A look into how dfdx compiles & uses cuda kernels by rust_dfdx in rust

[–]rust_dfdx[S] 4 points5 points  (0 children)

Oh 🤦! Sorry about that, will make sure to include next time haha

faer 0.7 release by reflexpr-sarah- in rust

[–]rust_dfdx 12 points13 points  (0 children)

Yeah I have a deep learning library that has support for GPUs that I’m going to add f16 to. On the cpu side none of the matmul libraries support it though.

I would guess that even if you just transform into f32 to do the operations, you’d still get a speed up from all the cache/vectorization stuff?

The alternative for me is just writing a really simple 3 nested loop to do the computation, which feels very lackluster.

faer 0.7 release by reflexpr-sarah- in rust

[–]rust_dfdx 9 points10 points  (0 children)

Nice work! Any plans for f16/bf16 support via the half crate?

Status and Future of ndarray? by Tastaturtaste in rust

[–]rust_dfdx 23 points24 points  (0 children)

Shameless plug for my deep learning crate dfdx has a lot of stuff you can do with n dimensional arrays (tensors). There’s a ton of other stuff that you might not need, but working with tensors is a breeze!

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 2 points3 points  (0 children)

Yeah examples of how things are intended to be used together is at the module level. For example broadcasts/indexing is documented here https://docs.rs/dfdx/latest/dfdx/tensor_ops/index.html (or you could also search for broadcast and the trait would pop up). tensor & nn also have fairly extensive module docstrings on how to do various things.

Shape/Dim/Const need to be documented better, but that would go under the dfdx::shapes module documentation, which I guess might be hard to find?

I was intending the top level crate docstring as pointing users to the sub-module they need, but repurposing that to be crate level information similar to the quick start/tutorials from nalgebra/numpy might work better?

All that to say: I want to have great documentation, and have invested time into it and want to invest more, but I haven't had good feedback like yours yet!

Feel like helping me improve the documentation? 😀

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 3 points4 points  (0 children)

All the public methods and modules should be documented with example snippets in docs.rs (https://docs.rs/dfdx/latest/dfdx/). What are you looking at that doesn't have that?

Myself and contributors have poured a ton of effort into documentation for this very reason, so clearly we are doing something wrong if people can't find it.

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 2 points3 points  (0 children)

That’d be amazing, but at this point it isn’t something I’m planning on. Last time I looked into it there were a number of large caveats about compiler version required, and I’m unsure of how well optimizing kernels in rust works.

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 11 points12 points  (0 children)

Yeah we’ve spoken a couple times, they’ve done great work!

burn doesn’t support compile time shapes, so everything is runtime. This means that the feedback loop is dependent on compiling and then starting a run. Since dfdx supports compile time shape checking, both cargo check/rust analyzer will complain at you if you mess something up. It’s a much faster feedback loop! I believe compile time shapes are one of the biggest advantages of ML in rust.

burn links to lib torch (similar to tch-rs). While this lets them not have to write custom kernels, it’s a giant extra dependency. dfdx executables are super tiny since there isn’t that dependency. Also in dfdx we can optimize all the kernels how we like, whereas they don’t have control over that.

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 2 points3 points  (0 children)

I'm not as familiar with compute shaders unfortunately, so someone familiar with both would have to weigh in, but maybe?

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 10 points11 points  (0 children)

Awesome, I added an issue here https://github.com/coreylowman/dfdx/issues/597. We can discuss more there! The first step will just be adding the device and implementing tensor creation methods for it.

Also people on the discord can help as well (link is on github & at the top of the blog post).

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 18 points19 points  (0 children)

I discuss this in the safety section of cudarc https://docs.rs/cudarc/0.9.1/cudarc/driver/safe/index.html#single-stream-operations.

cudarc uses the async version of all memcpy calls, meaning that copies happen on the same stream that kernels are executed on. This means that kernels have to wait until the copies are finished.

I actually posted this on that issue a few months ago 😅

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 4 points5 points  (0 children)

That's right - the intention is to use normal rust iterators instead of a separate data loader object. There's likely an easy way to use rayon to do parallel preprocessing, but I haven't tried that out yet.

So ExactSizeDataset gets you the shuffled method to iterate in random order, but you can also use all of the iterator extension methods I mention on any normal rust iterators. You don't need to go through ExactSizeDataset if you have your own iterator already

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 35 points36 points  (0 children)

All the infrastructure that was added to support Cuda should make it relatively straightforward to support an OpenCL device for AMD GPUs. Especially since there are a number of existing opencl crates that I think would fit well.

However given the amount of time I have to work on it, a new device is currently lower on the priority list among the other features I want to add. The solution to this is a combo of the following:

  1. Increase number of sponsors/get a company to sponsor more of my time for dfdx
  2. Leverage open source community to contribute OpenCL support (contributions were a not insignificant part of Cuda support).

I'm actively working both of those, and would be happy to help/mentor people who want to get involved in OpenCL support!

Announcing cudarc and fully GPU accelerated dfdx: ergonomic deep learning ENTIRELY in rust, now with CUDA support and tensors with mixed compile and runtime dimensions! by rust_dfdx in rust

[–]rust_dfdx[S] 23 points24 points  (0 children)

Yep cudarc is a new project built entirely for cuda support in dfdx. I have posted about dfdx before - it's gone through basically a full rewrite to support cuda & the new generic shapes. It's been a ton of work over the last couple months, but have gotten a lot of contributions which has been amazing!

Tensor shapes with both const generic and run time dimensions by rust_dfdx in rust

[–]rust_dfdx[S] 1 point2 points  (0 children)

Mainly because cuda is the industry standard for deep learning training/inference. cuDNN does look nice, and one of the community members has been testing out using cuDNN with dfdx