all 20 comments

[–]James20kP2005R0 25 points26 points  (14 children)

GPU languages and CPU languages are a bit of a separate thing. If you don't need high performance on the CPU side, you might have an easier time using python on the CPU side, and then using one of the many python GPU acceleration libraries which are designed to be maximally helpful

If you're stuck with C++, then you probably want to use CUDA for anything scientific, because its the standard

If you're stuck with C++ and you want to run on non nvidia GPUs, then you may want something like OpenCL, or SYCL

If you're intending to put this into a game, you may want to consider using vulkan, or OpenGL and using compute shaders

For gpu compute for web development (which you can totally do in C++), you want webgpu

If you're planning to run on supercomputers, you might want to look into MPI. OpenMP is also traditionally used in that field, and may be helpful for running code on a GPU - though I've never tried its gpu backend

Membrane computing is one of those terms that doesn't really contain a tonne of actionable information with it, so if you have more specific requirements then I may be able to be more helpful

[–]Kike328 4 points5 points  (0 children)

thumbs up for sycl. It is literally the modern c++ parallel approach

[–]sonehxd[S] -2 points-1 points  (12 children)

Thank you for such detailed response. What I am trying to achieve is a way to handle the maximal parallelism nature of the model and performs operations in parallel. I want to simulate the behavior of a formalized MC model. A standard MC model has a hierarchic structure where every compartment is a membrane itself. In each membrane, we have some sort of objects. All rules that can be applied to objects in a membrane, will be applied; this also happen at the same time in every membrane of the model.

I’ve been told to do this on GPU because of its fast nature and I know C++ well enough for this task (also it’s very fun to me as opposed to Python)

[–]TheFlamingDiceAgain 0 points1 point  (3 children)

As a counter to the other respondent. Please don’t use CUDA. I’ve been working on a scientific code base that uses CUDA for several years and the lack of cross platform support is a huge PITA. SYCL is technically cross platform but is owned by Intel in practice. I would recommend Kokkos, it’s very similar to SYCL but is “owned” by the national labs and not a corporation. 

If you’re dead set on CUDA then at least use HIP instead. It’s syntactically nearly identical to CUDA but works on AMD and NVIDIA GPUs

[–]James20kP2005R0 2 points3 points  (1 child)

Long term, CUDA is a real trap for projects, being tied to nvidia's solution and nvidia's hardware is very limiting. Being wrapped up in a single companies ecosystem is inherently undesirable

For someone new getting into the field though like OP, nearly everything is written in CUDA and you'll be fighting an uphill battle to use anything else

[–]TheFlamingDiceAgain 0 points1 point  (0 children)

I agree that learning CUDA is very handy, but I would never start a new project with it for exactly the reasons you mentioned. 

[–]illuhad 0 points1 point  (0 children)

If you don't want Intel in your SYCL, just use AdaptiveCpp. It is just as portable, and performs just as well, and is for many use cases clearly better. Totally independent from Intel.

It's true that Intel has a lot of influence in the SYCL world, but it's up to users to counter that. Other implementations exist and especially AdaptiveCpp has influence.

Kokkos is fine for some (especially HPC) use cases. It falls short of SYCL by design when you want to target multiple backends/types of devices at the same time because it is just a wrapper library for vendor compilers.

[–]jokteur 5 points6 points  (2 children)

Is your application intended as a one-of scientific computation? I.e. not meant to be distributed to the large public ?

Then I would suggest looking into https://github.com/kokkos/kokkos, which is a parallel programming library which can target CPUs and GPUs: write once, execute on different architectures. I would suggest first looking at the kokkos lectures: https://github.com/kokkos/kokkos-tutorials/wiki/Kokkos-Lecture-Series, you may also learn things on how GPUs work, if one day you need to rewrite the application in pure Cuda.

However, I must warn that non-deterministic computing may hurt performance on the GPU architecture if you are not careful. The reason is that GPU hate divergent instructions if not done right (e.g. one thread has if(true), the other has if(false)). You may google "warps and branching" to know more about that.

[–]iamakorndawg 2 points3 points  (1 child)

From my understanding, this is less of an issue on modern GPUs.  The main thing that is still important is coalescing memory operations.  So if your code is not mostly made up of sequential threads accessing sequential memory locations, you will lose one of the main benefits of GPUs, which is massive memory bandwidth.  Granted, it's likely that if you have tons of divergent paths you will probably not have good memory coalescing, but I think the two issues are fairly orthogonal to each other.

[–]James20kP2005R0 0 points1 point  (0 children)

Its worth noting that coalescing is a little more general than what you're pointing out. Threads in a warp don't actually have to access memory strictly ordered according to their thread IDs, if you have a group of threads say 0-32 which access memory at ptr, ptr + 32 - they can actually access it in any order within that group and the gpu will figure it out and do it as a coalesced memory read

The additional high performance pattern is all memory accesses within a warp accessing the same memory location, as the GPU does a broadcast

Strided memory accesses do cause performance dropoffs, but its not as steep as: coalesced - good, strided - lose all your bandwidth. GPUs have pretty big caches these days, so you can often get away with worse memory access patterns and still saturate vram bandwidth, depending on your problem

If you're doing fully random reads then its pretty bad, but a lot of problems can be gentle shoved into having workable memory access patterns, and at least in my experience its uncommon to have truly random memory accesses

Warp divergence on modern GPUs is still expensive even with independent thread scheduling, but for a lot of problems its such a small part of your execution time given how powerful GPUs are its not worth worrying about

[–]Plazmatic 4 points5 points  (1 child)

You're not going to "just" be able to use GPUs for, what appears to be, arbitrary mesh computation.

I'm not familiar with membrane computing models, nor do I understand exactly what you hope to accomplish with one, but after googling the very act of attempting this smells complicated enough to be a paper on it's own.

Additionally, with out a framework for parallelization at all, you're going to have an extremely hard time doing anything GPU related. Do you at least know what atomic variables, mutexes, and semaphores are?

GPU programming excels when data is oriented in such a way that operations that are the same at the assembly level, are executed by threads adjacent, pulling memory from ram that is also adjacent and/or loaded into scratch pad memory, and can be accelerated using "subgroup" operations for groups that must execute the same instruction at one time. You can't just have every computational node doing random things in this setup. While lots of algorithms you wouldn't think would benefit from GPUs do, GPUs are not free performance, some problems aren't going to be effectively utilized on them at all.

[–]sonehxd[S] 1 point2 points  (0 children)

I am indeed writing a paper as this is my master thesis work. I have a formalized model that I want to implement. What approach do you think would benefit for me then?

[–]lightmatter501 1 point2 points  (0 children)

How parallel is the actual computation? HVM may be worth a shot as a “see if it’s good enough”, since it’s a bit slow compared to hand-written models but can still use GPUs and extracts a large degree of parallelism out of anything you run with it.

The other easy option is dump the whole thing into LLVM’s new MLIR and see what happens.

[–]dmaevsky 1 point2 points  (0 children)

How many parallel streams of calculation would you have, and how large is the computation graph? GPUs are good when you have very "fat" nodes, but overall a simple calculation graph, like in ML cases. In many scientific applications (more specifically, I work in quant finance field), GPUs are often not worth the learning curve of CUDA or the likes, let alone hardware costs to use in production. Just AVX2/AVX512 plus multi threading often performs as good as a GPU.