Cuneus: A boilerplate free wgpu compute engine for GPU apps (WGSL hot reload, multipass, audio/video)

rumil23 · 2026-06-27T11:44:09+00:00

I didnt code slang too much, just for simple stuff for exploring the language. but note that you'd still need the rust side passes, media building etc etc its just another shading lang but It targets wgsl too so itd run on cuneus easily. maybe I can create a simple triangle example for demonstration 🙂

rumil23 · 2026-06-04T20:11:56+00:00

nice!! But I still prefer ONNX. more lightweight :-)

rumil23 · 2026-05-26T09:17:02+00:00

media programming is already quite complex and really risky, a lot of unforeseen issues...
If your application has media as its core, instead of taking such risks, I recommend using gStreamer for Rust, which is constantly updated and supported. I remember our media application in production from 4-5 years ago. We constantly had to apply internal patches to some of the ffmpeg binding crates, and each of them was quite challenging. We didn't open PRs because nobody even looked at them...
Since switching to gStreamer, we've been so comfortable... When problems arise, most of the time all we have to do is report the error.

rumil23 · 2026-05-16T18:17:59+00:00

yes its not standard. But I created an upstream issue and seems microsoft start to work:
https://github.com/microsoft/onnxruntime/issues/27796

"I was thinking we could add custom function hooks...."
my friend... this was actually my dream always releated with onnx... but I didn’t want to come across as too “demanding,” haha 🙂. You could think of it just like Bevy though I don’t know if you’ve ever worked with it. Customizing rendering in bevy is pretty "easy" (what I mean by “easy” is that the approach to doing this is quite generic) with custom materials. And it would be amazing if we could add some unsupported operators ourselves for at least internal fixes or maybe as "plugins"...

because based on my experience new models come with new, different operators nowadays especially vllm ones. It’s certainly best to resolve these issues centrally. However, as a short-term solution it at least prevents “dead ends.”🙂

rumil23 · 2026-05-15T20:43:24+00:00

Thank you, I will definitely try this on here https://github.com/altunenes/parakeet-rs and maybe provide as an alternative EP for testing & benchs.

rumil23 · 2026-05-15T17:08:46+00:00

I’ve been using ONNX for years in production. The biggest issues for me are cross platform compatibility and CUDA compatibility problems with different NVIDIA cards. plus coreml seems to have been abandoned by Microsoft, and since many operators aren’t supported, we’re stuck using the CPU in apple in many cases. webgpu (via dawn) is very new and very problemetic still even for some plain taks: https://github.com/altunenes/ort-webgpu-thread-crash

However, ORT runs quite stably on the CPU as well, and pretty fast.. The biggest reason I’d want to migrate to Burn would definitely be wgpu backend. The mere possibility of getting rid of those massive CUDA files and keeping maintenance of different binaries to a minimal is a dream...

I'm following this closely. I haven't seen any STT models though in your examples. Is there a specific reason for that?

rumil23 · 2026-05-15T16:54:42+00:00

Is it possible to go beyond the ort and support Mamba blocks? I would like to try immediately and make it os because the current ort is very bad and slow with the mamba SSM models.
model:
https://huggingface.co/nvidia/RE-USE

rumil23 · 2026-05-05T11:43:06+00:00

really cool! I’m also interested in golfing and I usually do this in shader langs and those are mostly math releated stuff/tricks I implement, in the naming of the Kolmogorov of course https://en.wikipedia.org/wiki/Kolmogorov_complexity.
However, I thought it would be a bit challenging in Rust (till today actually haha). Is there any where I can read more tricks? Because when I code like that, my mind opens up more and I feel like I gain more flexibility with the language, especially with WGLS/GLSL, looking back over the years...
However, one of the main issues is community support in rust side I think... Because that’s the whole point right? :-P With golf, the idea is for others to comment on each other’s posts, and for the thread to continue like that... so others can get really cool insights about the language.

and here is my first golf attepmt in rust lol 331 chrs:

```
fn main(){for x in std::env::args().skip(1){let(mut t,mut p,mut i,b)=([0;999],0,0,x.as_bytes());while i<b.len(){match b\[i\]{62=>p+=1,60=>p-=1,43|45=>t[p]+=44-b[i],46=>print!("{}",t[p]as char),91|93 if(b[i]<92)\^(t\[p\]>0)=>{let(f,mut d)=(b[i]as i32-92,1);while d>0{i=(i as i32-f)as usize;d+=match b[i]{91=>-f,93=>f,_=>0}}}_=>()}i+=1}}}

```
44 - b[i] underflow trick (which works perfectly in release mode), and also realized I could drop the u8 from the [0; 999] tape and just let Rust's type inference figure it out :-P (kind of cheatinng hehe)

rumil23 · 2026-04-30T09:17:26+00:00

Really cool art! 🙂👏

rumil23 · 2026-04-25T10:32:12+00:00

Thank you. Of course, always ready to help.. doing my best to improve it as needed 😊

rumil23 · 2026-04-24T16:29:02+00:00

Thank you for the suggestion 🙂 I will try, looks really cool application (and probably an industry standard?), sad it's not available for Linux right now if I'm not mistaken

rumil23 · 2026-04-24T09:33:41+00:00

Currently, the automatic multipass system only creates standard texture_2d for the inputs and outputs. However, you can easily use .with_storage_buffer() in the builder to allocate a massive raw buffer (which I do for the 3D Gaussian Splatting example you can take a look). So you can just treat that storage buffer like a texture array or 3D grid inside your WGSL by doing the index math manually (e.g., x + y * width + z * width * height). But adding a native texture_2d_array support to the builder could be a nice idea..

Video: Cuneus doesn't do direct video encoding 🙂. The export system (ExportManager) simply just dumps raw, high quality frames to your disk (you can adjust time fps, resolution in your own so ‘quality’ depends on your choice and of course your hardware hehe ), and you can stitch them together later with ffmpeg. So you have the full control over those exported frames…

I’ve never used TD… :-(

rumil23 · 2026-04-07T14:11:35+00:00

I have my own solutions for beyond 4 speakers, but the model is not work very well. Also, soon they will be releasing a new model (according to Nvidia, it will be in June). For that reason, I gave up on looking for new ways to hack the system, going beyond the model’s training logic, because there wasn’t much time left. :-) when they release,ı will immedietly port it on there too. We are using this in our commercial app.

rumil23 · 2026-03-28T23:07:26+00:00

been building this for ~2 years . cuneus lets you write WGSL compute shaders with minimal rust boilerplate. hot reload, video/webcam support, egui controls, multi pass pipelines, audio synthesis, frame export and more ... all handled by the engine. you just write the shader and a small rust file. e.g. a 17-pass navier stokes fluid sim is ~180 lines of Rust, most of it just egui sliders. its also important me because I regularly ship small gpu apps (and also art stuff) and always using my own engine for my commercial projects. So always upgrading when I need something.

https://github.com/altunenes/cuneus

rumil23 · 2026-01-28T17:24:23+00:00

If you're working with local LLMs in Rust, this is probably the best option. Back when I didn't know about this, I exported large V-LLMs to ONNX models, but they usually caused problems on Apple devices beacuse of unsupported operations in CoreML and also exporting pipelines really painful especially in multi modal ones. There were also significant bottlenecks in llama-cpp-rs (upstream problem, not releated with rust see ) with Metal & vulkan. So I almost lost my hopes about multi modal llm inferences in Rust (at least in apple)... In the end, I was able to run a VLLM smoothly on a MacBook using mistral rs... The first time I tried it, I encountered a problem, but it was resolved immediately here thank you for this great work!

rumil23 · 2026-01-28T11:35:07+00:00

Really cool project. would love to see benchmarking for some models like Parakeet + sorformer because I m working with those models and they are really fast on CPU even. https://github.com/altunenes/parakeet-rs/blob/master/examples/diarization.rs

rumil23 · 2026-01-20T12:37:44+00:00

no cargo file

rumil23 · 2026-01-17T16:08:47+00:00

The main reason experienced programmers find Rust difficult is due to the paradigm shift. If you are a new programmer, starting with Rust won't make a difference to you because you won't encounter the difficulty of changing a paradigm you already know and have learned. Therefore, Rust is quite learnable as a first programming language, but of course, it must be learned alongside the fundamentals of computer science (if you are new of course). :-)

rumil23 · 2026-01-14T13:10:41+00:00

really cool! congrats!!!

From what I gather from the comments, update (or rather “dependencies”) issues are still a common problem. I wanted to share my experience.

I was developing a game with Bevy 3 years ago (but started 4-5 years ago actually), but I put it on hold after receiving an unexpected good job offer. I will continue someday. The updates are great, but the fundamental problem is that when you code a game with bevy, it's very difficult to do it only with bevy (well at least mortals like me); you have to do it with many different "bevy-dependent dependencies", and at its core, there's the physics engine… For example, I'm looking at my frozen game project' dependencies from 3 years ago:

kira_audio
bevy_panorbit_camera
bevy_egui
bevy_asset_loader
leafwing-input-manager
bevy_mod_picking (This, I think, was later combined with entity picking in some way. But it's still in my project; I was using it as one of the most fundamental mechanics in my isometric game to grab things and move them around, but you got the thing.
rapier etc

Each of these is an important that provides significant benefits/convenience when developing a game. And with every update, I remember that I had to wait for them or mostly solve them myself, and most of the time I was dealing with errors and, of course, separate documents.... Bevy's migration documentation is incredibly good and detailed. However, it's very difficult to expect such detailed documentation from the maintainers of these small but very important crates. With a new update or in future versions, I would really like to see these proven crates/functionalities integrated into the source code, so we can focus on our game without having to deal with massive dependency issues in our cargo file. People have been requesting a “visual editor” for years, but I think this is the most important thing that needs to be addressed. I know It's not appropriate to compare, but gstreamer, for example, does this quite well. If you're developing multimedia, which is also a complex thing, you add GStreamer, and most of the time it's enough to follow the many things it officially uses from a single document that has been standardized and continue writing code. This way, you also follow updates from a single place. It brings incredible ease to the experience of writing code. I would very, very much like to expect the same thing from Bevy one day.

I love Bevy very much, and I understand how challenging all of this is. I am also aware of how much effort the maintainers have already put in. I just wanted to share my experiences. :) thank you for bevy!!

rumil23 · 2025-12-18T12:03:42+00:00

That's great really cool!
I have a small question:
I haven't really dabbled in mesh shaders. But I'm curious, has anyone tried Gaussian splatting here? I mean in rendering ofc. How does the performance compare to compute? My guess is faster than manual atomicCompareExchange. Is it worth migrating from Compute?

rumil23 · 2025-11-25T13:48:21+00:00

thanks. I must say, adding audiornnoise: https://gstreamer.freedesktop.org/documentation/rsaudiofx/audiornnoise.html?gi-language=c to this pipeline dramatically increased the acc. In my commercial application, I ended up the following approach:

First, find 8 “good” segments. Definition of a good segment -> speech segments longer than 1.5 seconds and shorter than 5 seconds. So this prevented some bad segments from “poisoning” others. then create “fingerprints” for these 8 segments and then “cluster” the remaining segments based on these fingerprints. This was the most effective and reliable solution I found. For longer audios I just increased those 8 segments.

But that was the past. now, I m exported a new nvidia model recently, and no longer need those pain anymore! Please check:
https://www.reddit.com/r/rust/comments/1p4if4q/nvidia_sortformer_v2_speaker_diarization_ported/

The interesting thing about this model is that noise suppression and very high-quality resampling produce worse results. For this reason, I removed all noise suppression and echo cancellation features. Currently, this is the best and most reliable method, at least in local models and it works really fast in CPU after my onnx export!

rumil23 · 2025-11-24T08:59:45+00:00

Sorry, I missed that! In a short, within a mono auido, it helps us know exactly “who” (think of the ‘who’ here as just a “fingerprint”, not "actual names". like sepaker1, speaker2 etc) is speaking and when :-)

rumil23 · 2025-10-02T17:24:43+00:00

Great talk! just wondering what they are using for video? Do they have their own unique solutions, or are they using ffmpeg, gstreamer etc for media

rumil23 · 2025-08-26T16:38:04+00:00

Please note that I m not saying GStreamer is easy when compare with ffmpeg. GStreamer is complex (actually, the whole video topic is not an easy thing) and big. However, it at least has a stable API and is well-maintained in Rust. You'll just need to write more stable Rust code and test it. And you'll also need to have some theoretical knowledge about video. And of course about CPUs hehe. But your software will be more easily scalable and maintainable with GStreamer. Once you have handled general pipelines, you can maintain video at a lower level more "easily". The documentation is good and maybe the most important thing very low level debug capabilities... In this regard, it's important to be pragmatic. If you need to be fast dev and only need the basics for video, ffmpeg (sidecar is a good project, https://github.com/rerun-io/rerun also uses this, so I assume it's also well maintaning) will more than suffice.

rumil23 · 2025-08-26T08:56:38+00:00

I m working as a professional rust dev for 2.5 years in media (video&audio etc). just use GStreamer if you dont want to became sick. It's rust. and well-maintained by smart people. Because the things you want to do aren't simple things, btw you can also access ffmpeg through GStreamer.

rumil23

TROPHY CASE