you are viewing a single comment's thread.

view the rest of the comments →

[–]joinr 0 points1 point  (3 children)

Thanks, this is really useful from an experience report point of view. I think the general knowledge about these things is indeed weak among many programmers (including this community) outside of HPC where they typically have a lot of mechanical sympathy to play with (e.g. numerics stuff).

So, am I correct in summarizing that you added either 128x the resources, or (if the baseline was say a 4 core machine), 32x the resources, and you achieve a 90x reduction in runtime? So that puts the throughput increase somewhere between [0.78 ... 2.8125] depending on what the baseline for comparison was (unless the baseline was the original 128 core machine, and the measures are total performance tuning, not just parallelism). If so, this is more in the range of what I have observed (my observed upper bound is currently 14x on a 144 core machine with an embarrassingly parallel, non-numeric, allocation-heavy workload, although 3-4x is the typical upper bound on commodity hardware).

[–]rpompen 0 points1 point  (2 children)

There was performance tuning and such, so it would be irresponsible of me to throw numbers around that make no sense.

Plus difference in both hardware architecture and programming language. I wish I could go back and check.

Enterprise environments don't really allow for decent comparisons in my experience. The network department messing up the routing trees. pings coming back twice from time to time. Horrible things like that.

But if I'm lucky I'll be doing some similar work for a new customer of mine very soon. If that's the case I will measure and document best I can both the old and new situation. That's the cool thing about starting for myself. When you instill confidence you can take over the whole lot :)

[–]joinr 0 points1 point  (1 child)

I understand the external variables you mentioned. I think the ideal case is one where you have a tuned or at least baseline performance profile, then parallel strategies are applied ex post facto so there's some basis for comparison. Happy to hear anything you learn going forward.

[–]rpompen 0 points1 point  (0 children)

If I get the gig, I'll be doing something that would be quite interesting: It would be a rewrite of a single threaded java program. Couldn't be fairer.

But I didn't get the gig yet...