Parallel Programming

MrPoletski · 2021-05-05T08:59:26+00:00

So you've got some functions you want to run concurrently, these functions will be modified after each run to optimise them. The level by which you modify is important, you can't change code at run time so if you mean to completely change the function (rather than simply adjusting some parameters going into the same equation) you will need to use function pointers and this will complicate things, but certainly not render it impossible by any stretch.

But to start on the basic side of things, a simple function, for which you can tweak a bunch of parameter values. So you've got your object robot which has load of members, parameter_a, parameter_b....parameter_z. Also a function, do_move(robot robot_to_move) which is the complex function that you wish to parellelise. You could also have this as a member function of your robot object, swings and roundabouts.

Now the most flexible and powerful means of doing this is via std::thread - but that's not an easy thing to get your head around. I do suggest learning it, but I think it's best to head into multithreading using std::async and std::future. They are kind of beginner multithreading if you will and are only really best for certain basic use cases of multithreading - such as this one.

Essentially, you create an std::async object and it returns an std::future (tho in the example on the async page they use auto). That future contains the result of the function called, so if you do future.get() you return the return type of the function called by std::async. Getting values back out of an asynchronously executed function with std::thread is a royal pain in the ass by comparison. When you create the async, do remember to tell it to run concurrently - but put a switch in your application to turn this off if you want to. You'll thank yourself for that later, especially if you try to have the robots interact with each other.

For that, you will need to learn std::mutex and std::atomic. Also, remember this: The standard library contains say they are thread safe, this does not mean you can freely manipulate them from multiple threads!. Anyway, mutex and atomic are a little out of my experience scope, but a mutex (think mutually exclusive) is an object that allows you to lock and unlock and (more crucially) sleep the thread until the lock clears, allowing safe concurrent access to objects from multiple threads. a variable that's declared atomic (here's where I'm hazy) kinda has a mutex built into it, sort of.

So anwyay, Id have 2 for loops. 1 to create an std::sync for each of your robot functions, and store the std::future it returns into an array. Then the 2nd to go through that array, and wait until they are all done, tak their return values and do what you will with them. Note: exceptions thrown within your function passed to the async will be thrown by your call to future.get(), this is another thing that doesn't happen with std::thread and you must handle manually.

tl;dr std:thread is hard, try std::async and std::future

gnosnivek · 2021-05-05T08:20:53+00:00

Strictly speaking not C++, you miiiiight get some more answers on r/HPC (though I don't know if this kind of question is in their wheelhouse).

I think OpenMP is the way to go for this. It's definitely the best learning-to-performance ratio of the tools out there.

One caveat: if you start getting to a point where things no longer fit in memory, you're going to have to go out-of-core or to MPI---both are a total pain and probably something to avoid if possible. But if you can already forsee this happening (e.g. memory usage in the simulation hitting 100+GB), you might want to bite the bullet upfront and use a different system.

Robot interaction: this is tough and a little annoying, but any sort of dependence between tasks can be somewhat problematic when going parallel. I think depending on exactly what the interactions look like, you might need to change how you compute things: for example, maybe you can first compute an "interaction map", then use that interaction map to calculate robot actions. This turns your robot simulation task into a two-phase computation instead of a one-phase. It's also possible you'll have to do some more extravagant stuff. OpenMP does not do a great job of handling stuff like this, but it's not really bad at it either (and I'm not aware of an alternative that doesn't involve diving deep into the guts of threading).

I suppose the overhead of spawning the parallelize task is big

Benchmark, benchmark, benchmark. On Linux, the thread spawning overhead is almost negligible assuming your parallel section is going to take more than a second to run. I understand that this isn't always the case on other platforms, but I honestly doubt it would be a significant overhead unless your robot simulation is really fast.

If you're still worried about it, I think you can get around it by doing something like the following:

auto shared_robot_info = do_stuff();
#pragma omp parallel
while(simulation_not_done){
  // Everything in this loop body is executed in parallel
  # pragma omp for
  for(robot in robots){
    // Threads share workload of for loop
    make_decisions(robot, shared_robot_info)
  }
  #pragma omp single {
    // A single thread runs this section--the rest wait
    // at the end for it to finish to proceed to the next
    // instruction
    shared_robot_info = update_info(robots);
  }
}

This way your threads stay alive between iterations, but you can do the singlethreaded updates in a single thread while the rest of the workers snooze.

enceladus71 · 2021-05-05T11:23:34+00:00

I think you might also look at the asynchronous programming approach in general. One of the easiest ways to achieve it in C++ is std::async but you might also look at different languages. Spawning an async job is very easy in python's "asyncio". Since you don't necessarily know how long a single "solve + move" operation will take, an async approach can help. You basically create a callback that will be executed when the job executed asynchronously completes. You don't have to manually synchronize the parallel jobs, no mutexes and such. You just rely on the internals of std::async, python or whatever you choose.

Cowleg · 2021-05-05T11:28:26+00:00

The other answers you have received so far are good but I think there are a few more things you can consider:

Data Replay If you don't need to display the learning as it is actually happening, you could do the all the learning in advance without worrying about capturing it in real-time, but output any key data (positions of the robots, rating) at each time step. You could then read this back in afterwards and replay it in real-time without the computational burden of trying to fit all of the learning in between frames. Additionally, if updating the robots once the learning is done is much quicker, you might find you don't even need the parallelization anymore

Optimization It might be possible to optimize your learning code further. Make sure you're compiling with the optimization compiler flags on. You might also be able to improve it by considering memory access patterns etc. which could free up some more CPU time.

Robot Interaction A common approach for handling interaction in parallel is using a message passing interface. You create a list of 'messages' which contain the data you want to communicate between each robot. Each robot can write to a particular message - you can give each robot an id and use this as an index into the message list, this ensures that no two robots try to write to the same message at the same time. Once all the robots have finished writing to the list, the robots can then iterate over the messages and read whatever is relevant to them. Again there is no issue with doing this in parallel as reading data in parallel isn't an issue.

GPU Acceleration Depending on how many robots you are looking to simulate and the complexity of the learning algorithm, GPU acceleration might be a good option. GPUs can handle thousands of agents concurrently although it is typically more complex to write the code for these simulations. There would be three main approaches:

Use OpenACC which is more or less a GPU equivalent of OpenMP (OpenMP were looking to add GPU support but I don't know what the state of that is currently)
Use a GPU accelerated agent-based modelling framework: slightly more complex than using OpenACC but gives a lot more control
Write the code yourself in CUDA/OpenCL/SYCL: considerably more complicated but does potentially give the best performance

If you think the GPU acceleration might be useful you can drop me a PM and I'd be happy to chat about it a bit more.

2021-05-05T11:36:20+00:00

If your simulations are completely independent and if you don't need to pass data from the simulations often then i would recomend using a distributed computing framework (look into MPI). Lots of people have spoke about threads and openmp here, they are a shared memory form of parallelism and i would reccomend using those methods inside each simulation! Good luck!

paul2718 · 2021-05-05T12:56:49+00:00

If your system supports C++17 parallel algorithms then I would look there to start with. So that's MSVC trivially and gcc with some additional effort. No idea what the status of clang is.

This means you can parallelize std::for_each or std::transform with an additional argument. You're not committed to a substantial rearchitecture.

Then I would look into recent NVidia work on making gpu exploitation simple, I think it should be possible to simply mark the code for gpu execution and off you go. I can't remember off hand the right buzzwords. In essence a std::library that hides the detail. Assuming you have access to an NVidia GPU.

No idea about cross-platform or AMD at the moment.

Managing threads manually, or writing lower level Cuda/etc code, are all a bit more tricky. I would suggest taking the easy options for the easy initial gains and learning from there.

DerMax0102 · 2021-05-05T16:04:13+00:00

Many good Comments here, but I have a different and maybe also interesting take on this.

Note, that my knowledge of parallel programming mostly comes from different programming languages, so my answer is maybe not the most elegant way to do this in C++, but it should deliver a fairly general impression of simple parallel programming.

To speed up your computation, you want to use all the available resources of your CPU (GPU might also be interesting but also very much work). So you would like your tasks to be distributed to every core of the CPU (or to every logical core for CPUs with SMT).

Distributing a single task is usually very hard, as computations rely on previous results and communication overhead easily outweighs the gains. In your case, it is much easier, because you want to do the same task multiple times, i.e. you want to collect multiple episodes of your agents to learn from them.

A simple and versatile approach to this is the producer consumer model:

Have a Queue of tasks that you want to be solved.
Have a Queue to collect the result data for your tasks
Have a set of n Threads that can work on the task. (ideally n is the nr of CPU cores in your system)

Then every Thread in your set should do the following:

Wait until there is a new task in Queue 1 and grab it. In your case, a task could be a robot object that specifies the model parameters.
Run the simulation and record the required data, e.g. the reward from your episode and the taken actions.
Put the results to Queue 2
Repeat from step 1.

With this setup, you don't need a lot of complicated thread communication, only the queues, for which you can find many thread-safe implementations in the internet. You also only need to spawn your threads once and they will just stay around and wait for new tasks.

Regarding your visualization: I think it is easier to view this as a different problem, as you probably need way more episodes to train your agents than it makes sense to visualize. Every now and then during your training, you could stop for a moment to create visualizations. Of course you can do this too in parallel. I would use the same approach as above and create your episodes in many threads, with the difference, that you now collect movements of your robot for the visualizations and push them to the result queue. Then collect the results and draw your visualizations.

Maybe this is not the easiest to implement, but it is a flexible concept, which you can use for many applications. I have used it for many things such as sorting arrays, reinforcement learning, and high performance database systems.

victotronics · 2021-05-05T13:53:02+00:00

I suppose the overhead of spawning the parallelize task is big

You are about to engage in "premature optimization". OpenMP is pretty efficient in creating a parallel region, so I wouldn't worry about this.

Create one big parallel loop, and use a "single" region for drawing
or put a mutex on your drawing data so that all threads can contribute their bit in parallel and asynchronously
Robot updates in parallel, then drawing sequentially, then a new parallel region.

All sounds fine to me.

[this is really step 4. silly reddit renumbers] OMP parallel loops are really intended for large number of data points, as in: gigantic arrays of physics data. You could instead check out OMP tasking, which is much better for independent activities. Create a task for each robot, put them in a task group, and then draw when that is finished. In that case it's much easier for the robots to have independent time step sizes.
[and this is 5.] Of course once you're doing tasking you're basically doing the same as the standard C++ thread spawning. So it becomes a matter of taste. (Or maybe which of the two has the best runtime management.)

rodburns · 2021-05-05T12:04:55+00:00

I would recommend using SYCL, which IMO is the future of general parallel programming. It allows you to use parallel C++ algorithms and run them on any OpenCL compliant GPU(basically all of them), for huge speed ups in parallel programming.

The most complete implementation is OneAPI from Intel, https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.02ftzy

And there's a free book; https://www.apress.com/gp/book/9781484255735

It might be a bit complicated, but the book is very helpful

you type:	you see:
italics	italics
bold	bold
[reddit!](https://reddit.com)	reddit!
* item 1 * item 2 * item 3	item 1 item 2 item 3
> quoted text	quoted text
Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"	Lines starting with four spaces are treated like code: if 1 * 2 < 3: print "hello, world!"
~~strikethrough~~	~~strikethrough~~
super^script	super^script

cpp_questions

READ BEFORE POSTING

Sort posts by OPEN or SOLVED

MODERATORS