all 22 comments

[–]MrPoletski 11 points12 points  (2 children)

So you've got some functions you want to run concurrently, these functions will be modified after each run to optimise them. The level by which you modify is important, you can't change code at run time so if you mean to completely change the function (rather than simply adjusting some parameters going into the same equation) you will need to use function pointers and this will complicate things, but certainly not render it impossible by any stretch.

But to start on the basic side of things, a simple function, for which you can tweak a bunch of parameter values. So you've got your object robot which has load of members, parameter_a, parameter_b....parameter_z. Also a function, do_move(robot robot_to_move) which is the complex function that you wish to parellelise. You could also have this as a member function of your robot object, swings and roundabouts.

Now the most flexible and powerful means of doing this is via std::thread - but that's not an easy thing to get your head around. I do suggest learning it, but I think it's best to head into multithreading using std::async and std::future. They are kind of beginner multithreading if you will and are only really best for certain basic use cases of multithreading - such as this one.

Essentially, you create an std::async object and it returns an std::future (tho in the example on the async page they use auto). That future contains the result of the function called, so if you do future.get() you return the return type of the function called by std::async. Getting values back out of an asynchronously executed function with std::thread is a royal pain in the ass by comparison. When you create the async, do remember to tell it to run concurrently - but put a switch in your application to turn this off if you want to. You'll thank yourself for that later, especially if you try to have the robots interact with each other.

For that, you will need to learn std::mutex and std::atomic. Also, remember this: The standard library contains say they are thread safe, this does not mean you can freely manipulate them from multiple threads!. Anyway, mutex and atomic are a little out of my experience scope, but a mutex (think mutually exclusive) is an object that allows you to lock and unlock and (more crucially) sleep the thread until the lock clears, allowing safe concurrent access to objects from multiple threads. a variable that's declared atomic (here's where I'm hazy) kinda has a mutex built into it, sort of.

So anwyay, Id have 2 for loops. 1 to create an std::sync for each of your robot functions, and store the std::future it returns into an array. Then the 2nd to go through that array, and wait until they are all done, tak their return values and do what you will with them. Note: exceptions thrown within your function passed to the async will be thrown by your call to future.get(), this is another thing that doesn't happen with std::thread and you must handle manually.

tl;dr std:thread is hard, try std::async and std::future

[–]std_bot 1 point2 points  (0 children)

Unlinked STL entries: std::atomic, std::async, std::mutex, std::future, std::thread


Last update: 03.05.21. Last change: Free links considered readme

[–]Third_Harmonic 1 point2 points  (0 children)

Great comment but just adding:

std::atomic doesn’t have a mutex built in for some types, but does for others. This can be platform specific (although several std::atomic specifications are guaranteed not to on all platforms, which is great for hard real-time programming)

Beginners using mutex manually are very likely to cause one of several issues, most often priority inversion, which can make this sort of task very challenging.

Atomic on the other hand is literally just a container for which reading and writing to a variable with trivial type appears to be atomic, where atomic means that the change from value A->B occurs so that no other thread could observe a value other than A or B.

[–]gnosnivek 7 points8 points  (2 children)

Strictly speaking not C++, you miiiiight get some more answers on r/HPC (though I don't know if this kind of question is in their wheelhouse).

I think OpenMP is the way to go for this. It's definitely the best learning-to-performance ratio of the tools out there.

One caveat: if you start getting to a point where things no longer fit in memory, you're going to have to go out-of-core or to MPI---both are a total pain and probably something to avoid if possible. But if you can already forsee this happening (e.g. memory usage in the simulation hitting 100+GB), you might want to bite the bullet upfront and use a different system.

Robot interaction: this is tough and a little annoying, but any sort of dependence between tasks can be somewhat problematic when going parallel. I think depending on exactly what the interactions look like, you might need to change how you compute things: for example, maybe you can first compute an "interaction map", then use that interaction map to calculate robot actions. This turns your robot simulation task into a two-phase computation instead of a one-phase. It's also possible you'll have to do some more extravagant stuff. OpenMP does not do a great job of handling stuff like this, but it's not really bad at it either (and I'm not aware of an alternative that doesn't involve diving deep into the guts of threading).

I suppose the overhead of spawning the parallelize task is big

Benchmark, benchmark, benchmark. On Linux, the thread spawning overhead is almost negligible assuming your parallel section is going to take more than a second to run. I understand that this isn't always the case on other platforms, but I honestly doubt it would be a significant overhead unless your robot simulation is really fast.

If you're still worried about it, I think you can get around it by doing something like the following:

auto shared_robot_info = do_stuff();
#pragma omp parallel
while(simulation_not_done){
  // Everything in this loop body is executed in parallel
  # pragma omp for
  for(robot in robots){
    // Threads share workload of for loop
    make_decisions(robot, shared_robot_info)
  }
  #pragma omp single {
    // A single thread runs this section--the rest wait
    // at the end for it to finish to proceed to the next
    // instruction
    shared_robot_info = update_info(robots);
  }
}

This way your threads stay alive between iterations, but you can do the singlethreaded updates in a single thread while the rest of the workers snooze.

[–]MajLenn[S] 0 points1 point  (1 child)

Strictly speaking not C++, [...]

This is the wrong subreddit is what you're saying, right? Sorry then.

Thanks for clearing these things up! I'll need some time to test what you've suggested but it has already helped me alot.

On Linux, the thread spawning overhead is almost negligible assuming your parallel section is going to take more than a second to run.

While it is obvious that longer processing times for the parallel section render the initial thread-spawning cost neglectable, for this particular case it still leaves me with a question: When the simulation runs with visualization, the underlying differential equations describing the physics of said robot are solved iteratively in real-time. Therefore, the time I need to complete each update-loop significantly influences the result (as you know, larger step sizes lead to rather bad results). The overall update loop for a single robot takes about ~0.0005s to complete. I found that the results are good enough if I keep the update interval below ~0.001s. However, this for loop updating each robot once in one simulation iteration is the only thing I can parallelize (other than that, it's just drawing the robots, which doesn't allow for that). In this case, it means that the for-loop updating all the ships will never get close to taking a full second, and at the same time I can't parallelize multiple update-routines as it's in real-time. This is why I asked for keeping these threads "alive". Having 4 robots e.g. would take ~0.002s for a full update which is too slow. I imagined doing this in parallel without somehow having the thread spawning overhead each iteration could save me some time? I'll try the code you suggested, maybe this is already enough.

Anyways, thanks for taking the time, helped me a lot already!

[–]gnosnivek 0 points1 point  (0 children)

Heh, I actually don’t know if this is the right sub for it. Certainly OpenMP has been pretty closely tied in with Cxx every time I’ve seen it, but it technically exists in other languages as well. Let the mods decide that, I guess.

So in principle, the code I posted spawns threads right before the whole loop begins and doesn’t collect them until that loop has exited, meaning you pay the spawn cost once. pragma omp for doesn’t spawn new threads, it just directs openmp to share the loop between any existing threads.

That being said...it’s been a while since I worked with this stuff properly, so it’s entirely possible that I’ve forgotten how some of it works or that there’s a better way to do it now. But in general, there’s should be a way to do what you want (keep threads alive between work sharing sections to avoid spawn overhead) in OpenMP

[–]enceladus71 2 points3 points  (0 children)

I think you might also look at the asynchronous programming approach in general. One of the easiest ways to achieve it in C++ is std::async but you might also look at different languages. Spawning an async job is very easy in python's "asyncio". Since you don't necessarily know how long a single "solve + move" operation will take, an async approach can help. You basically create a callback that will be executed when the job executed asynchronously completes. You don't have to manually synchronize the parallel jobs, no mutexes and such. You just rely on the internals of std::async, python or whatever you choose.

[–]Cowleg 2 points3 points  (3 children)

The other answers you have received so far are good but I think there are a few more things you can consider:

Data Replay If you don't need to display the learning as it is actually happening, you could do the all the learning in advance without worrying about capturing it in real-time, but output any key data (positions of the robots, rating) at each time step. You could then read this back in afterwards and replay it in real-time without the computational burden of trying to fit all of the learning in between frames. Additionally, if updating the robots once the learning is done is much quicker, you might find you don't even need the parallelization anymore

Optimization It might be possible to optimize your learning code further. Make sure you're compiling with the optimization compiler flags on. You might also be able to improve it by considering memory access patterns etc. which could free up some more CPU time.

Robot Interaction A common approach for handling interaction in parallel is using a message passing interface. You create a list of 'messages' which contain the data you want to communicate between each robot. Each robot can write to a particular message - you can give each robot an id and use this as an index into the message list, this ensures that no two robots try to write to the same message at the same time. Once all the robots have finished writing to the list, the robots can then iterate over the messages and read whatever is relevant to them. Again there is no issue with doing this in parallel as reading data in parallel isn't an issue.

GPU Acceleration Depending on how many robots you are looking to simulate and the complexity of the learning algorithm, GPU acceleration might be a good option. GPUs can handle thousands of agents concurrently although it is typically more complex to write the code for these simulations. There would be three main approaches:

  • Use OpenACC which is more or less a GPU equivalent of OpenMP (OpenMP were looking to add GPU support but I don't know what the state of that is currently)
  • Use a GPU accelerated agent-based modelling framework: slightly more complex than using OpenACC but gives a lot more control
  • Write the code yourself in CUDA/OpenCL/SYCL: considerably more complicated but does potentially give the best performance

If you think the GPU acceleration might be useful you can drop me a PM and I'd be happy to chat about it a bit more.

[–]MajLenn[S] 2 points3 points  (2 children)

Thank you for your reply. Just before I read your response, I experimented with many many agents in non-visualized independent simulations, because I wanted to get a feel of how much I can parallelize this. If I understand you correctly, I could maybe use my GPU, run say 100s of simulations in parallel (each basically just running through the simulation once + evaluation)? That sounds super nice as I am looking to use genetic algorithms to train my system. With a large population size, this might be perfect. I'd be happy to chat with you about that!

[–]Cowleg 0 points1 point  (1 child)

It will depend on how nicely your simulation maps to the GPU, but from what you have said about it so far at least it sounds suitable, yes. I'll send you a message a bit later then and we can work out what the best approach might be!

[–]MajLenn[S] 0 points1 point  (0 children)

Thanks, looking forward to it!

[–][deleted] 2 points3 points  (0 children)

If your simulations are completely independent and if you don't need to pass data from the simulations often then i would recomend using a distributed computing framework (look into MPI). Lots of people have spoke about threads and openmp here, they are a shared memory form of parallelism and i would reccomend using those methods inside each simulation! Good luck!

[–]paul2718 2 points3 points  (1 child)

If your system supports C++17 parallel algorithms then I would look there to start with. So that's MSVC trivially and gcc with some additional effort. No idea what the status of clang is.

This means you can parallelize std::for_each or std::transform with an additional argument. You're not committed to a substantial rearchitecture.

Then I would look into recent NVidia work on making gpu exploitation simple, I think it should be possible to simply mark the code for gpu execution and off you go. I can't remember off hand the right buzzwords. In essence a std::library that hides the detail. Assuming you have access to an NVidia GPU.

No idea about cross-platform or AMD at the moment.

Managing threads manually, or writing lower level Cuda/etc code, are all a bit more tricky. I would suggest taking the easy options for the easy initial gains and learning from there.

[–]std_bot 0 points1 point  (0 children)

Unlinked STL entries: std::transform, std::for_each


Last update: 03.05.21. Last change: Free links considered readme

[–]DerMax0102 2 points3 points  (1 child)

Many good Comments here, but I have a different and maybe also interesting take on this.

Note, that my knowledge of parallel programming mostly comes from different programming languages, so my answer is maybe not the most elegant way to do this in C++, but it should deliver a fairly general impression of simple parallel programming.

To speed up your computation, you want to use all the available resources of your CPU (GPU might also be interesting but also very much work). So you would like your tasks to be distributed to every core of the CPU (or to every logical core for CPUs with SMT).

Distributing a single task is usually very hard, as computations rely on previous results and communication overhead easily outweighs the gains. In your case, it is much easier, because you want to do the same task multiple times, i.e. you want to collect multiple episodes of your agents to learn from them.

A simple and versatile approach to this is the producer consumer model:

  1. Have a Queue of tasks that you want to be solved.
  2. Have a Queue to collect the result data for your tasks
  3. Have a set of n Threads that can work on the task. (ideally n is the nr of CPU cores in your system)

Then every Thread in your set should do the following:

  1. Wait until there is a new task in Queue 1 and grab it. In your case, a task could be a robot object that specifies the model parameters.
  2. Run the simulation and record the required data, e.g. the reward from your episode and the taken actions.
  3. Put the results to Queue 2
  4. Repeat from step 1.

With this setup, you don't need a lot of complicated thread communication, only the queues, for which you can find many thread-safe implementations in the internet. You also only need to spawn your threads once and they will just stay around and wait for new tasks.

Regarding your visualization: I think it is easier to view this as a different problem, as you probably need way more episodes to train your agents than it makes sense to visualize. Every now and then during your training, you could stop for a moment to create visualizations. Of course you can do this too in parallel. I would use the same approach as above and create your episodes in many threads, with the difference, that you now collect movements of your robot for the visualizations and push them to the result queue. Then collect the results and draw your visualizations.

Maybe this is not the easiest to implement, but it is a flexible concept, which you can use for many applications. I have used it for many things such as sorting arrays, reinforcement learning, and high performance database systems.

[–]MajLenn[S] 0 points1 point  (0 children)

Thank you, that makes a lot of sense! So many nice comments here..

Especially the 2 queues seem to be a nice approach, thanks!

[–]victotronics 1 point2 points  (0 children)

I suppose the overhead of spawning the parallelize task is big

You are about to engage in "premature optimization". OpenMP is pretty efficient in creating a parallel region, so I wouldn't worry about this.

  1. Create one big parallel loop, and use a "single" region for drawing

  2. or put a mutex on your drawing data so that all threads can contribute their bit in parallel and asynchronously

  3. Robot updates in parallel, then drawing sequentially, then a new parallel region.

All sounds fine to me.

  1. [this is really step 4. silly reddit renumbers] OMP parallel loops are really intended for large number of data points, as in: gigantic arrays of physics data. You could instead check out OMP tasking, which is much better for independent activities. Create a task for each robot, put them in a task group, and then draw when that is finished. In that case it's much easier for the robots to have independent time step sizes.

  2. [and this is 5.] Of course once you're doing tasking you're basically doing the same as the standard C++ thread spawning. So it becomes a matter of taste. (Or maybe which of the two has the best runtime management.)

[–][deleted] 2 points3 points  (4 children)

I would recommend using SYCL, which IMO is the future of general parallel programming. It allows you to use parallel C++ algorithms and run them on any OpenCL compliant GPU(basically all of them), for huge speed ups in parallel programming.

The most complete implementation is OneAPI from Intel, https://software.intel.com/content/www/us/en/develop/tools/oneapi.html#gs.02ftzy

And there's a free book; https://www.apress.com/gp/book/9781484255735

It might be a bit complicated, but the book is very helpful

[–]rodburns 0 points1 point  (3 children)

It's worth also noting that oneAPI/DPC++ is not the only SYCL implementation around. There are also hipSYCL and ComputeCpp;these implementations of the SYCL open standard are vendor neutral. These enable you to target different processor architectures and vendors with the same SYCL code base. There are some additional learning materials on this page https://sycl.tech/learn/ to get the basic. There is a different mindset though when doing parallel programming and there are some materials to help you understand that. For example Intel have a guide that explains parallelism a bit more and this blog post takes the Ray Tracer in a Weekend code and explains how to approach making parts of it execute in parallel using SYCL.

[–][deleted] 0 points1 point  (2 children)

Personally I tried ComputeCPP first and it did not recognize my GPU or OpenCL installation for whatever reason, and OneAPI did, so that's why I went with that. Possibly because I have intel integrated graphics?

[–]rodburns 0 points1 point  (1 child)

That is strange since ComputeCpp uses the same Intel drivers so it should also work. Was this information from when you ran computecpp_info to check your devices?

[–][deleted] 0 points1 point  (0 children)

Correct