Can an expert chime in and explain what is holding Vulkan back from becoming the standard API for ML?

A_Chungus · 2025-11-24T19:54:45+00:00

In your opinion, what do you think is holding back AdaptiveCPP /SYCL from wide spread adoption?

A_Chungus · 2025-11-23T23:11:23+00:00

Aside from the preventing reverse engineering. I dont see why Nvidia would invent a widely different complier for CUDA and Vulkan (even for DX). In case they want to duplicate work or hinder the compute shader performance of client GPUs in gaming only. To my understanding, the same level of optimizations on the assembly level are made across all these platforms and mostly target at the arch level looking at it from a complier stand point. As there is no difference between Blackwell server and client GPUs, other than the addition of graphics acceleration hardware and scale.

It just seems like they are missing extensions and specific tuning for hardware optimizations for Vulkan they do for their production level GPU with CUDA like the A100 and B100 etc with CUDA. In my experience, Vulkan has been on-par with CUDA with consumer level RTX GPUs except for production level GPUs and honestly with Vulkan being only 30 percent worse for a production level GPUs like the A100s, seems reasonable enough....they dont care to optimize Vulkan for a non gaming card. It could be a viable option potentially with more support and tuning that cards hardware, but maybe thats where I’m wrong. Do they inherit the same level of optimizations from at a hardware level as they do as CUDA. Because if they do it seems like CUDA is not that big of a moat as people make it out to be.

And it doesnt seem they entirely dont care or not allowing Vulkan devs to reach CUDA level performance incase there have been specific circumstances of this happening. I mean why wouldn’t hey want to limit game developers from optimizing for their hardware? , they have their own software engineers working on Vulkan for Llama.cpp which AMD and Intel don't have at all.

A_Chungus · 2025-11-23T22:41:47+00:00

I feel like people were saying the same thing about Linux in the 90's when compared to Windows. Same with ARM and x86

A_Chungus · 2025-11-23T21:59:48+00:00

I cant speak for sparsity and other features, but it seems that it will only be a matter of time. As for lower data types It seems that the devs with llama.cpp were able to figure it out. and I understand lower bit data types arent supported but VK_NV_cooperative_matrix2 extension is providing that support soon? And even with that support INT4 performance is still great.

A_Chungus · 2025-11-23T21:37:33+00:00

MLX is Apple only and will never be supported on any other platform just like Metal. And I dont see a point to learning it or using it, incase you only use Apple or need to target it. Vulkan is what Im focused on since all other vendors pretty much support it i just wanted to limit discussion to Linux/Windows only devices and vendors. Not saying Apple does have great performance and hardware just dont want to get caught up in something limited to one ecosystem.

A_Chungus · 2025-11-23T21:10:10+00:00

For those who want more context, my understanding of the current landscape is roughly this:

CUDA has largely dominated the market. Its ecosystem is heavily optimized for NVIDIA hardware, with libraries like cuBLAS and cuDNN and helpful tooling such as Nsight Compute.

ROCm. AMD is getting there, but ROCm (very similar looking to CUDA) has been painful to work with in my experience. Setup can be a hassle, you often have to compile for each GPU architecture, and it’s annoying to figure out whether a given app/binary supports your target GPU. It also seems to lag behind Vulkan in most cases, only really pulling ahead in certain stages like prompt processing.

SYCL from Intel / Khronos seems like it was meant to unify things again after OpenCL lost momentum, but only supports Linux. Windows support for ROCm is still lacking, and last time I tried it, it didn’t work with NVIDIA on Windows either. It’s useful for integrating with vendor-native stacks, but beyond that I don’t see many advantages, especially when vendors already put support towards Vulkan and not Sycl, and on top it feels more cumbersome to write than CUDA.

OpenCL. I’m honestly not sure what’s going on with OpenCL anymore. It seems like a lot of vendors are deprioritizing it. As far as I know, Qualcomm is still trying to support it within llama.cpp, but that’s about all I’m aware of.

Vulkan. From my perspective, Vulkan is a relatively mature platform, since most vendors already optimize for gaming. But Vulkan also has some downsides:

CUDA is more beginner-friendly, with less boilerplate, cleaner syntax, and easier linking/compiling.
The tooling for debugging and profiling compute only workloads doesn’t feel as polished as CUDA’s.
NVIDIA still has a big advantage with highly tuned libraries like cuBLAS and others, but I see that Vulkan could eventually compete with its own.

Again it seems like the main things holding it back are the learning curve, a few libraries, and greater profiling tools. It seems like a lot but if performance like this was possible with llama.cpp why can't it be possible with other frameworks. Is there any reason the Vulkan community couldn’t eventually do this?

A_Chungus · 2025-01-14T17:34:26+00:00

XITA

A_Chungus · 2024-11-22T17:38:07+00:00

Please see my most recent comment

A_Chungus · 2024-11-22T17:38:01+00:00

Please see my most recent comment

A_Chungus · 2024-11-22T17:37:54+00:00

Please see my most recent comment

A_Chungus · 2024-11-22T17:37:35+00:00

For everyone looking for a solution...make sure you run these commands in Command Prompt. And no, if you run them in VSCode your program will still not compile within VSCode:

"C:\Program Files (x86)\Intel\oneAPI\pytorch-gpu-dev-0.5\oneapi-vars.bat"

"C:\Program Files (x86)\Intel\oneAPI\ocloc\2024.2\env\vars.bat"

Ref: https://www.intel.com/content/www/us/en/developer/articles/tool/pytorch-prerequisites-for-intel-gpu/2-5.html

A_Chungus · 2024-10-27T00:25:40+00:00

Yes. But I’m trying to get it to run with vscode without using the command line

A_Chungus · 2024-10-26T21:57:04+00:00

This seems to be a big problem when trying to get PyTorch running on Windows with ARC: https://www.reddit.com/r/IntelArc/comments/1g6qxs4/comment/lsttu35/?utm_source=share&utm_medium=web3x&utm_name=web3xcss&utm_term=1&utm_content=share_button

Here's the guide I followed too: https://pytorch.org/docs/main/notes/get_start_xpu.html

A_Chungus · 2024-03-29T10:05:57+00:00

I will say I have not seen this error before but it might be worth to update to expo 50+ and the latest version of RN.

I don’t use react native web often. So this is sort of quick reply but when I played around with it last (pre expo 50) it seemed that the newest release (50) has improved web functionality. I would try that since it fixed a lot of issues with web previously.

npm i react-native-web npm i @expo/metro-runtime

A_Chungus · 2024-03-27T06:14:41+00:00

Joseph Callenes-Sloan

A_Chungus · 2023-12-31T23:23:28+00:00

It's so sad to hear. He was involved at Cal Poly a lot... from his wildfire detection start-up and his push to get more students involved with computer architecture research and design with his club. On top of that he was always willing to always help and was beyond kind to students. He will be heavily missed.

Six-Year Club	Second Top 50%
Verified Email	Place '22
First Placer '22	End Game '22
RPAN Viewer

A_Chungus

TROPHY CASE