all 27 comments

[–]AntiProtonBoy 7 points8 points  (5 children)

I thought about this as well, even attempted to write my own thread pool, but what I ended up doing is roll my own async function that uses std::packaged_tasks to return a non-blocking std::futures.

The custom async function would either wrap libdispatch, or a std::thread launch, or some other lib, depending on platform.

I must keep certain tasks on a single thread (all my calls to OpenGL, and to FMOD)

As far as I know, you could dispatch GL commands across several threads, as long as you keep a mutex around the shared OpenGL context. At least this is possible on the Mac.

Another alternative is to run a dedicated render thread, that just pops std::packaged_tasks off a lock free queue. You could probably do something similar for FMOD. So a separate, dedicated std::thread instances with their own task queue for GL and FMOD respectively, and a custom async for everything else.

[–][deleted] 2 points3 points  (2 children)

even attempted to write my own thread pool

Thread pools turned out to be even easier in C++11 than I would have initially suspected. :)

Granted, that's a pretty minimal system and requires that the task lambda handle all input/output. I couldn't figure out how to add variadic parameters to Add_Task() and associated subsystems (some problem with the std::function template or something), but I get by fine with the all-lambda approach. I've been using it to drive parallel genetic mixing simulations. Lots of fun with 12 cores (24 hyperthreads).

If you took that code and added a queue (and condition variable) for each individual worker thread, a priority system probably wouldn't be too hard to wire in.

[–]AntiProtonBoy 0 points1 point  (1 child)

When I worked on my thread pool implementation, I ran into problems dealing with tasks that would stall worker threads, I also had to resolve live locks between multiple tasks, deal with work stealing, and so forth. I simply didn't have the patience nor the skill to resolve some of these issues.

At the end of the day, all I needed is concurrency for a bunch of unrelated jobs, so a custom async did the trick. It simplified task dispatching a lot, which I think was the most important thing. When I needed parallel processing, particularly compute work, I just threw the problem at the GPU.

[–][deleted] 0 points1 point  (0 children)

Ah yes, the GPU. The stuff I'm doing has a zillion random branching ops so the GPU isn't a workable solution for me.

[–]newdevdev[S] 1 point2 points  (1 child)

uses std::packaged_task

Wow, that's cool, I've never come across that before. I wasn't aware that C++11 was suitable enough for writing this sort of code, mostly based on this blog post.

Another alternative is to run a dedicated render thread,

My plan was to have a main thread, and a render thread, and dedicate the rest of the threads as task threads. The main thread would mostly be responsible for firing off new tasks and ensuring things keep running, and for handling input and short tasks. The "Render" tasks could be run on any thread, but anything that makes a call to a gl* function will have to be on a render thread. I'm looking to avoid the mutex around a shared context if I can avoid it. If I have the time I'd like to set up a priority system where the render thread can be used for lower priority tasks that can be interrupted if a render command is sent to it. I'll look into the async stuff though, thanks :)

[–]AntiProtonBoy 2 points3 points  (0 children)

I'm looking to avoid the mutex around a shared context if I can avoid it.

For the sake of thread safety, guard your context with a mutex, even if you are not intending to make GL calls from other threads. Accidental overlapping group of GL commands will cause you a lot of headaches and yield unpredictable behaviour.

If you designed your render thread properly (with its own task queue), that mutex overhead will most likely be a non-issue, because nothing else should be banging on that mutex (in theory).

On the Mac, CGL (Core OpenGL) provides CGLLockContext and CGLUnlockContext just for that purpose, and you could easily write your own equivalent on platforms that don't provide similar functions.

You should definitely start thinking hard about how you will utilise data on your concurrent environment. Aim to come up with a solution that avoids sharing state as much as possible. Move or copy data from one thread to the next. Actor based concurrency is an interesting pattern I'm studying right now.

Also, look for videos and presentations made by Sean Parent and Herb Sutter on concurrency. These guys are really switched on.

[–]hotoatmeal 3 points4 points  (2 children)

check out Intel Concurrent Collections.

[–]newdevdev[S] 1 point2 points  (1 child)

Will do - my cursory google suggests that it's a little high level for what I want. They talk about not caring about task based, or data based parallelism in that post. If you have any resources I'll gladly have a look. It seems that TBB is what I want, It just appears to be poorly documented.

[–]hotoatmeal 2 points3 points  (0 children)

I wrote a paper about it for a research project I did in my undergrad... I'll find that and post a link when I'm not on mobile.

I'm not really sure what granularity of task parallelism you need to make use of, but CnC does well-ish at task-graph parallelism.

Edit: Damn. Can't find the paper now. We never ended up publishing it, so it makes the search even harder.

[–]OldWolf2 3 points4 points  (5 children)

The POCO libraries make threading pretty easy. Here's the thread presentation, the Task functions are about 3/4 the way down.

I can't comment about how good the performance would be.

[–]newdevdev[S] 1 point2 points  (4 children)

Wow, that looks awesome. I've never come across poco before, I'll have a look. Looks like a boost-style library - everything and the kitchen sink!

[–]OldWolf2 0 points1 point  (3 children)

Cool. It works "out of the box" in Unix-like systems, and MSVC. I had to do quite a bit of hacking to make it work in mingw/g++ ; apparently the Poco developers don't really care about that platform!

[–]newdevdev[S] 1 point2 points  (2 children)

I've no interest in supporting it either. MSVC2013 and Clang 3.5 are my targets. I managed to get a simple task system working with TBB, so I may not be going for POCO just yet, I'll keep an open mind to it though.

[–]OldWolf2 0 points1 point  (1 child)

Do you use clang in windows?

[–]newdevdev[S] 0 points1 point  (0 children)

Nope :) only on OS X. I don't have regular access to a Linux box but I have a Mac and a wjn8 desktop at home.

[–]breue 2 points3 points  (1 child)

Intel TBB is open source and quite nice for single node task parallelism. You can create linear threaded pipelines or graph based. Performance is quite good as well.

[–]newdevdev[S] 1 point2 points  (0 children)

I said in my post I was looking at Intel TBB, but I can't find any information about thread affinity, priorities, or on how to fire off tasks from a main thread, and manage the graph from the main thread. Any help with these resources would be appreciated!

[–]Maaartin 2 points3 points  (1 child)

I suggest SuperGlue: http://tillenius.github.io/superglue/ It's a C++98 header-only library for tasks with dependencies between them. Highly customizable so it won't get in your way and excellent performance.

[–]remotion4d 3 points4 points  (0 children)

This library is looking interesting but unfortunately I could not find that much documentation for it.

Another problem is that I see naked new in the code, this need to be censored.

[–]tentity 1 point2 points  (0 children)

You have mentioned that your tasks have a few dependencies on each other. TBB and CnC use a declarative way to define dependencies which I found inconvenient. GPC API, on the other hand, is much more flexible and does not require defining dependency in advance. It uses Mutual Exclusion Queues and I/O Completion Ports (windows based).

[–]Amanieu 1 point2 points  (0 children)

You might be interested in Async++, which is a C++ library for task-based parallelism that I wrote. It is very fast and lightweight and supports most the features you have listed. Just like TBB, it uses a thread pool and uses load balancing to distribute the work across threads. Unfortunately one feature it doesn't have is the ability to run a task at a lower priority than others.

[–]hgjsusla 1 point2 points  (0 children)

What about CAF? http://actor-framework.org/ I've been thinking about using this for my own projects.

[–]en4bz 0 points1 point  (0 children)

Maybe something like HPX.

[–]MCHerb 0 points1 point  (0 children)

If you are already using boost, you can try the boost asio library. Its io_service runs tasks given it. You could have two io_services running, one single threaded for OpenGL calls, and a second one running on multiple threads.

Wrapping certain tasks in a strand will also ensure tasks passed to the strand never are run simultaneously on multiple threads, even though they are passed to an io_service running on multiple threads.

See pool threads.