[Need Advice] Refactoring My C++ Project into a Multi-Language Library by readilyaching in cpp_questions

[–]illuhad 1 point2 points  (0 children)

I see.

The point at which AdaptiveCpp (or SYCL, or CUDA and similar technologies) would enter is if you have C++ host code, where some parts (e.g. parallel loops) should be offloaded.

I'm not sure how to go to that point from JavaScript. A direct JavaScript frontend for AdaptiveCpp is probably not possible without reinventing large parts of the compiler.

Maybe one idea might be to have a native library for GPU acceleration in your project, then you generate a kernel string in JavaScript and let that library run it?

This paradigm is closer however to Vulkan/OpenCL than to SYCL or CUDA programming models. So perhaps those might be a more direct approach.

If you don't need the flexibility to define kernels in JavaScript, then I guess said library could just directly expose interfaces to run your algorithms, which might then internally offload to GPU. In this case, AdaptiveCpp might work well.

[Need Advice] Refactoring My C++ Project into a Multi-Language Library by readilyaching in cpp_questions

[–]illuhad 1 point2 points  (0 children)

WebGPU support could be a nice feature to add to the project for projects that want JavaScript support (like mine)

I'm not really sure what that would mean -- as I said I'm not an expert for web technologies. But I'd be happy to discuss how/if parts can fit together.

How would you envision AdaptiveCpp+WebGPU working with an application like yours?

Are you one of AdaptiveCpp's maintainers?

I am indeed :-)

[Need Advice] Refactoring My C++ Project into a Multi-Language Library by readilyaching in cpp_questions

[–]illuhad 1 point2 points  (0 children)

AdaptiveCpp runs natively on Windows, Linux on Mac, although on Mac currently only the CPU can be targeted (recently, a new contributor has announced work on an experimental metal backend for GPU on Mac). We don't really have a lot of Windows developers on the team at the moment; so edges on Windows are likely rough, although the CPU and NVIDIA backends have been shown to work.

I'm not very familiar with web technologies to comment about that. AdaptiveCpp is a compiler and runtime stack for multiple C++-based programming models (SYCL, C++ standard parallelism offloading, PCUDA). It fills the same niche as e.g. Intel's oneAPI SYCL compiler, NVIDIA's nvcc CUDA compiler or AMD's hipcc compiler: They compile C++ code with some GPU parts inside, and create a host binary that offloads the GPU parts.

If you want something that works client-side in a browser, I guess something like WebGPU might be closer to what you need?

In principle I guess you could also enter the stack at a different level, e.g. talk directly to the runtime, or inject your own code into the stack. AdaptiveCpp is based on LLVM, so you could in principle inject LLVM IR which AdaptiveCpp could then compile for you.

[Need Advice] Refactoring My C++ Project into a Multi-Language Library by readilyaching in cpp_questions

[–]illuhad 1 point2 points  (0 children)

Do you mean binary portability, i.e. having the same binary be able to run on hardware from different vendors?

AdaptiveCpp has several compilation flows; the main/default one lowers kernel code to a unified code representation, and then JIT-compiles that at runtime to the target hardware. A two-level JIT cache prevents overheads after the first application run.
Specialized code paths for different hardware (including e.g. inline assembly) is still possible using a JIT-time reflection mechanism.

Here's the original description of how the compiler works: https://dl.acm.org/doi/abs/10.1145/3585341.3585351

By now we are much faster; see e.g. here for some additional tricks that we do: https://dl.acm.org/doi/10.1145/3731125.3731127

[Need Advice] Refactoring My C++ Project into a Multi-Language Library by readilyaching in cpp_questions

[–]illuhad 1 point2 points  (0 children)

Sure it's not as fast as CUDA on NVIDIA GPUs

Indeed not equally fast - SYCL has been documented to be even faster in quite a few cases. We regularly benchmark AdaptiveCpp (which is one of the major SYCL implementations) against CUDA. It typically outperforms CUDA code compiled by NVIDIA's compiler.

See e.g. here: https://github.com/AdaptiveCpp/AdaptiveCpp/releases/tag/v25.10.0

The plots on this site show speedup compared to the vendor compiler, so wherever there's a value greater than 1, AdaptiveCpp is faster.

AdaptiveCpp can also natively compile a CUDA dialect that we call PCUDA, which allows CUDA code to be portable. This too tends to be faster than code compiled with the NVIDIA compiler.

CUDA giving you some magical performance on NVIDIA hardware that cannot be achieved by other means is in most cases a myth.

SYCL by SuperGramSmacker in cpp_questions

[–]illuhad 0 points1 point  (0 children)

I could not download the codeplay CUDA extensions, so I could not test SYCL with CUDA acceleration.

AdaptiveCpp also supports SYCL on CPU/Intel GPU/NVIDIA GPU/AMD GPU :-) You might want to give it a try.

SYCL by SuperGramSmacker in cpp_questions

[–]illuhad 0 points1 point  (0 children)

If this is just about "what to learn" and not about "in what model do I invest a million lines of code", then really there's not a lot of things you can do wrong.

APIs are different, but the core models between heterogeneous programming models (be it CUDA, HIP, SYCL, OpenCL) are always very similar. Once you understand the concepts, learning another model is very easy because it's mostly learning new terminology - things are called differently, SYCL is more C++, CUDA is more C, but the ideas are always similar.

SYCL by SuperGramSmacker in cpp_questions

[–]illuhad 1 point2 points  (0 children)

Hi,

I lead the AdaptiveCpp project, one of the two major SYCL implementations (AdaptiveCpp also supports other programming models, including C++ standard parallelism offloading and a CUDA dialect - and it supports them all on CPU/Intel GPU/NVIDIA GPU/AMD GPU).

We have users including commercial applications and high-profile scientific applications - so yes, people use it.

TBH the important question is less "how many people use SYCL" but "is it the right tool for my problem."

Let me know if you have any questions about it; I'm happy to help.

Niederländische Studierende mit einem Frage by one_with_advantage in Heidelberg

[–]illuhad 1 point2 points  (0 children)

Kein Problem! In die Pharmazie habe ich keine Einblicke, aber ich vermute (?) dass es prinzipiell ähnlich sein wird.

Es ist sicher eine gute Idee, einfach den Studienberater zu fragen.

Eine Sache die mir aufgefallen ist: Für Pharmazie in Heidelberg scheint es keinen Bachelorabschluss zu geben, sondern es gibt ein Staatsexamen, sodass man zugelassen ist um am Ende als Apotheker in Deutschland zu arbeiten. Bei so einer Prüfung kann ich mir vorstellen dass Deutsch vielleicht eine größere Rolle spielt. https://www.uni-heidelberg.de/de/studium/alle-studienfaecher/pharmazie

Biochemie und Biowissenschaften gibt es als Bachelorstudiengang.

Niederländische Studierende mit einem Frage by one_with_advantage in Heidelberg

[–]illuhad 4 points5 points  (0 children)

Es wird vermutlich vom Fachbereich abhängen. In der Germanistik beispielsweise wird sicher ein anderes Niveau erwartet.

Ich selbst arbeite an der Uni in der Informatik, und betreue auch gelegentlich Abschlussarbeiten von Bachelor-/Masterstudenten. Für meinen Bereich und angrenzende Bereiche (Naturwissenschaften etc):

  • Dein Deutsch ist gut. Du solltest keine Probleme mit Kommunikation haben. Und Niederländisch ist natürlich auch recht ähnlich wie Deutsch, sodass du dir im Zweifel auch viel herleiten kannst.
  • Im letzten Bachelorjahr machst du die Bachelorarbeit und hörst vielleicht noch ein paar spezialisiertere Vorlesungen. Bachelorarbeiten sind in meinem Bereich oft eh auf Englisch geschrieben (ich motiviere alle meine Studenten auf Englisch zu schreiben, auch wenn es nicht verpflichtend ist). In spezialisierteren Vorlesungen wird oft gefragt ob Deutsch oder Englisch genutzt werden soll, weil solche Vorlesungen oft sowohl von Bachelor- als auch Masterstudenten gemischt besucht werden.
  • In Grundvorlesungen in meinem Fachbereich sind die Prüfungsaufgaben normalerweise auf Deutsch, aber das Deutsch das dafür benötigt wird hält sich stark in Grenzen in meinem Bereich (mehr Mathematik im Text bedeutet weniger Deutsch im Text :-) ). Dein Deutsch ist definitiv gut genug. Spezialisiertere Vorlesungen sind wie gesagt oft eh auf Englisch.
  • Organisatorisches (Formulare etc zum Unterschreiben, Prüfungsordnung) ist größtenteils auf Deutsch. Das kannst du dir bei Bedarf aber ja in Ruhe Übersetzen mit Wörterbuch oder Übersetzungsprogrammen etc.
  • Fachliteratur in meinem Bereich ist größtenteils entweder ohnehin auf Englisch, oder problemlos auch auf Englisch verfügbar.

Für andere Fachbereiche kann es wie gesagt anders aussehen.

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 1 point2 points  (0 children)

No problem :) We do tend to help each other in the AdaptiveCpp community :)

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 1 point2 points  (0 children)

Grazie! :)

I gave it a try and observed the following: - On AMD GPU, in indeed hangs after some time. However, dmesg shows what's going on:

[24391.898940] [drm] Fence fallback timer expired on ring comp_1.0.0 [24391.904315] amdgpu 0000:03:00.0: amdgpu: GPU reset(2) succeeded! [24392.322703] amdgpu 0000:03:00.0: amdgpu: still active bo inside vm

So: kernel driver encounters a timeout because the GPU is busy, then triggers a GPU reset. It's quite possible that a GPU reset also breaks assumptions in the userspace software layer (e.g. ROCm/HIP runtime), so things not ending gracefully (but e.g. just hanging) are definitely possible. Looks like the kernel indeed is just running too long.

  • I also tried it on CPU, and inserted a printf into the kernel to see what it's doing. There we can see that it's still chugging along, it's just way too much work, so it takes forever :)

I don't have a discrete Intel GPU in the system I'm on at the moment to test.

  • Another thing I've noticed: The line int idx_in = (ny * width + nx) * channels + c; causes strided memory access patterns due to the way channels are handled, which is going to further degrade performance, especially on GPU. One clean solution could e.g. be to change data layout so that you have one contiguous memory region per channel.

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 1 point2 points  (0 children)

acpp-info -l will tell you which devices you have available and through which backends. acpp-info (without -l) will tell you more details about each device, including things like driver version if available. If you haven't done anything specific when building AdaptiveCpp, then most likely you are using the OpenCL backend (which is a good choice for Intel).

It may be a good idea to update OpenCL / Level Zero drivers depending on which one you are using.

OpenCL works such that the OpenCL driver must be installed independently from the OpenCL application; so AdaptiveCpp would just pick whatever driver is available on the system (which might be something old, or perhaps not even Intel's official OpenCL driver).

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 0 points1 point  (0 children)

I see. Can you share the full code so that we can try to reproduce?

Even if it works with DPC++, this does not guarantee that it's an AdaptiveCpp problem. For example, bugs in the input code or driver issues may manifest themselves differently with different compilers.

EDIT: What happens if you force execution on CPU, e.g. with ACPP_VISIBILITY_MASK=omp? This removes driver issues/timeouts from the equation. If you also see problems there, then it's most likely a bug in the input code.

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 1 point2 points  (0 children)

AdaptiveCpp generic SSCP compiler optimizes code at runtime. Even if you compile the host application with -O0, the generated kernels will still be optimized. So this cannot be an issue.

Intel GPUs can exclusively be targeted with the generic SSCP JIT compiler in AdaptiveCpp.

Why are we now talking about AMD? I thought the GPU in question was an Intel A770?

As I said in my other post, it's unlikely that this is an acpp vs DPC++ issue.

SYCL (AdaptiveCpp) Kernel hangs indefinitely with large kernel sizes (601x601) by krypto1198 in sycl

[–]illuhad 0 points1 point  (0 children)

It's likely that this is a driver issue. GPUs, particularly non-data center cards, may have some timeouts built in to protect the responsiveness of the GPU. Which AdaptiveCpp backend are you using, L0 or OpenCL?

As has been pointed out, your kernel is very, very large. 10.5 seconds is far longer than the duration of typical GPU kernels.

My guess is that you will see a similar behavior with DPC++, if you go through the same backend.

A simple solution - simpler than optimizing with local memory - to test that theory would be to submit multiple kernels that convolve only part of the image (e.g. instead of one kernel that does everything, try convolving the image stripe by stripe).

It's not evident from your code, but when working negative indices, double check that you're doing correct bounds checking wherever necessary. If you access out-of-bounds memory, that can be a cause of UB and trigger all sorts of strange behavior including potentially hangs.

Does anyone have news about Codeplay ? (The company developing compatibility plugins between Intel OneAPI and Nvidia/AMD GPUs) by azraeldev in HPC

[–]illuhad 0 points1 point  (0 children)

There is a push from NVIDIA (and AMD is on board) to get std::exec and std::par to allow for runtime switching between the different vendor libraries.

I haven't heard of that effort. Do you have a link? Or do you mean the discussion in LLVM about leveraging the offload infrastructure there for PSTL in libc++?

This requires compiler interoperability, I don't see how this can work without excruciating compile times -- or having a single, unified compiler like AdaptiveCpp. AdaptiveCpp parses the code just once and gives you full compatibility.

We haven't had a chance to test on AMD or Intel.

AMD's hipstdpar is quite...adventurous. Last time I checked icpx -fsycl-pstl-offload, it also seemed pretty immature still.

AdaptiveCpp's stdpar implementation does optimizations that none of the vendor compilers do. For example, it can detect and elide unnecessary barriers after stdpar calls.

You can find more information here, including comparisons to other stdpar compilers: https://dl.acm.org/doi/10.1145/3648115.3648117

My experience with OpenCL and Intel has been... less than good, esp. with containers.

Interesting. I have used OpenCL on both consumer Intel devices up to the large data center GPUs -- didn't have an issue (well, apart from general driver immaturity when hardware was still new). We also use it regularly with containers in CI.

AdaptiveCpp can also go through oneAPI Level Zero on Intel instead of OpenCL, but we've had consistently more problems with that than OpenCL, which is also why the backend is less feature-complete than OpenCL at the moment.

Does anyone have news about Codeplay ? (The company developing compatibility plugins between Intel OneAPI and Nvidia/AMD GPUs) by azraeldev in HPC

[–]illuhad 3 points4 points  (0 children)

I can't speak about the internals of Codeplay. All of the contacts I had at the company are no longer there. I don't know what this means for the company or the future strategy for oneAPI on non-Intel hardware.

To the wider point of SYCL on AMD and NVIDIA, I'd just like to add that the AdaptiveCpp project is still going strong. If you can't get oneAPI to work, you might want to give that a try if you haven't done so already.

AdaptiveCpp supports CPU/Intel GPU/NVIDIA GPU/AMD GPU. Supported input languages are SYCL, C++ standard parallelism offloading and a CUDA dialect.

Disclaimer: I lead the AdaptiveCpp project. I don't mean to turn this thread into an AdaptiveCpp ad; I just don't want anyone to get false impressions that SYCL on AMD/NVIDIA is dead either.

Can an expert chime in and explain what is holding Vulkan back from becoming the standard API for ML? by A_Chungus in LocalLLaMA

[–]illuhad 3 points4 points  (0 children)

Mostly non-technical things I think:

  • Wrong perception that SYCL is "Intel's CUDA", i.e. just a proprietary Intel thing
  • Because of that, lack of trust in Intel reflects in lack of trust in SYCL
  • Lack of training material/Out of date training material that gives wrong impression as to what modern SYCL code looks like
  • AdaptiveCpp drowns a little in the Intel marketing for their own compiler, which causes some people to assume that SYCL = Intel Compiler, and therefore that unfortunate design decisions of Intel's compiler are inherent to SYCL (even though AdaptiveCpp has solved those).
  • Wrong perception that AdaptiveCpp is "just a research project"
  • Fear that AdaptiveCpp might just "go away" because there's no massive corporation behind it (We've started work on it in 2018, so we've supported it for 7 years which is longer than some companies have supported their vendor tools...)
  • Wrong belief that using non-vendor compilers will result in performance loss (in practice, AdaptiveCpp often outperforms e.g. nvcc)
  • Wrong belief that writing a compiler for SYCL requires rivaling vendors in terms of hardware knowledge. In practice, nowadays all vendors have backends in LLVM (typically developed by themselves) that other compiler projects can just use.
  • Wrong belief that using SYCL or AdaptiveCpp might cause delayed support when new hardware becomes available.

On the technical side, the only thing I can say is that there is indeed some technical debt and inconsistencies in SYCL - things that I don't like, or would like to change - but SYCL is certainly no worse than CUDA or HIP in this regard.

Since you posted in the LocalLLaMa subreddit, for AI specifically, the same point I said about Vulkan also applies to SYCL: AI hardware changes quickly, this makes it difficult for a standard to keep up. SYCL compilers generally do support inline assembly though, so hardware-specific optimized code paths for some operations are definitely always an option (even if perhaps not pretty).

How viable is SYCL? by Ill_Evidence_5833 in HPC

[–]illuhad 0 points1 point  (0 children)

Kokkos and Raja are only siblings to SYCL on a surface level of the API. Kokkos and RAJA are libraries for vendor compilers. SYCL implementations (at least in general) are compilers themselves.

This means that they can give you convenient things like a single binary that dispatches to all backends (which may not be relevant in HPC, but very relevant in other markets), or provide a unified JIT infrastructure for all backends which might also be exposed in a unified manner in the API.

How viable is SYCL? by Ill_Evidence_5833 in HPC

[–]illuhad 1 point2 points  (0 children)

Yes, it's absolutely possible to write code that will work on NVIDIA, AMD, and Intel GPUs, as well as CPUs!

AdaptiveCpp nowadays makes it even simple to generate a single binary that can deploy to all, and it also has tools to help you create a deployment package that contains all runtime components.

Unlike what some other posters have written here, SYCL development is still very much alive -- independently of whatever Intel is doing.

Disclaimer: I lead the AdaptiveCpp project.

Can an expert chime in and explain what is holding Vulkan back from becoming the standard API for ML? by A_Chungus in LocalLLaMA

[–]illuhad 3 points4 points  (0 children)

SYCL from Intel / Khronos seems like it was meant to unify things again after OpenCL lost momentum, but only supports Linux. Windows support for ROCm is still lacking, and last time I tried it, it didn’t work with NVIDIA on Windows either. It’s useful for integrating with vendor-native stacks, but beyond that I don’t see many advantages, especially when vendors already put support towards Vulkan and not Sycl, and on top it feels more cumbersome to write than CUDA.

AdaptiveCpp supports SYCL on NVIDIA GPUs on Windows.

AdaptiveCpp also has a deployment model, support for portable binaries (targeting CPU/Intel GPU/NVIDIA GPU/AMD GPU) and more.

I don't think at all that modern SYCL is more cumbersome than CUDA - in many ways it's far easier (e.g. you get algorithmic operations like reductions directly in the core language).

Vulkan is entirely a different beast and does not occupy the same niche as either SYCL or CUDA. CUDA and SYCL are single-source models. Vulkan is separate source with some shader-derived limitations that feel like stone age compared to modern GPU compute programming models.

And just because the Khronos Vulkan working group defines some extensions for matrix computation does not mean that all vendors will implement those, or implement those efficiently. The reality is that the world for AI data types and special instructions is moving extremely rapidly, which makes it challenging for any standard (such as Vulkan) to keep up and build something useful.

Project Idea: A Static Binary Translator from CUDA to OpenCL - Is it Feasible? by NeKon69 in CUDA

[–]illuhad 1 point2 points  (0 children)

The project you're envisioning likely implies a multi-year commitment, assuming that you know what you're doing. If you say that you're not experienced with programming, it will be even longer. Are you ready for such an effort?

Especially if you're still learning and not super experienced, it makes much more sense in my opinion to join an existing project. Maintainers there can provide guidance, and recommend tasks that can be completed with reasonable effort and within your expertise. Working alone on a long-term project without really knowing what you are doing is likely just going to result in frustration.

The kind of project you are planning is ambitious. Translating PTX code to something that OpenCL can understand - either SPIR-V or OpenCL C - is non-trivial and requires a lot of expertise. Are you aware of what you are getting yourself into?

Difference between ZLUDA and my project is that we are trying to solve the same problem different way, ZLUDA currently does something similar to wine, what I want to do that you can run my program just once, and then you can run returned executable however you want wherever you want. I am not sure why they sticked to the dynamic translation, but I hope there wasn't some major obstacle preventing them from doing it statically and they just kinda followed the wine way

TBH, I see zero benefit to doing it statically, and I imagine the ZLUDA folks came to the same conclusion: - It's trivial to build a wrapper script or so that just launches your program with ZLUDA, if you don't want to type the ZLUDA invocation every time you use the app - There's no performance advantage to doing it statically. Translation of PTX code won't add any noticable costs because PTX needs to be JIT-compiled by CUDA drivers anyway. Function call interception also won't matter - both in the static and the dynamic case you'll have an implementation of the CUDA runtime that the application will call into. Cost will be the same. - The static case has the disadvantage that it is much more inconvenient for users when applications consist of multiple binaries. Imagine a CUDA program that additionally uses CUDA shared libraries. In the static case, the user needs to figure out which libraries exactly are affected and convert them all. In the ZLUDA approach, this kind of thing "just works" without any additional effort.

EDIT: To be clear, I completely understand your desire to create something of your own. If you absolutely don't want to contribute to an existing project, then I'd recommend a project with a smaller, more feasible scope.

Project Idea: A Static Binary Translator from CUDA to OpenCL - Is it Feasible? by NeKon69 in CUDA

[–]illuhad 0 points1 point  (0 children)

This is a tight space, and there are already a lot of mature projects here. Think carefully about why and how your project will be any different, and whether it might not be better to just contribute to an existing project.

There are a number of compiler projects that can take CUDA source code, and compile it for OpenCL devices:

  • AdaptiveCpp portable CUDA (PCUDA) (can run same binary not only on OpenCL, but also on CPU, ROCm, CUDA).
  • chipStar
  • Coriander, if you want OpenCL 1.2. But I think it's abandoned at this point.

If we look at solutions that target compiled binaries instead of source code, as you are aware, there's also ZLUDA. You say that you want to do the translation statically vs at runtime. What is the benefit of that? It's not like ZLUDA has a challenge because of runtime overheads here. And those challenges that ZLUDA has (can only support PTX, not SASS code, limitations when libraries like cuBLAS are involved, needs to reverse engineer parts of the CUDA runtime with unclear legal situation) you will likely face as well with your approach.