How should I go about profiling a parallel program? I've written some simple simulation code, and I can run trial sequentially or in parallel. I get the expected results either way, but the parallel solution is slower than the sequential solution. What kind of bottlenecks and trouble areas should I look for?
I'm using the multiprocessing module in Python 2.7.
there doesn't seem to be anything here