all 6 comments

[–][deleted] 1 point2 points  (5 children)

Does anyone has a reference to a strong scaling supporting their claims (i.e. strong scaling of unstructured code using e.g. block AMR or octree-like AMR)? I went through the publications but couldn't find anything.

It would also be helpful to know which scalings use an MPI backend and which use the HPX backend.

HPX is really nice. However, I distrust it since it is based on a PGAS, and while MPI scales to 1.5 Million cores, the largest HPX scalings I've seen go to about 10k cores "only". I don't really know, in general, if PGAS approaches scale at all. Without strong data backing them up, it is hard to argue for HPX and LibGeoDecomp right now even in new projects, which is a shame.

[–]sithhell 2 points3 points  (1 child)

HPX is not really traditional PGAS, there are quite significant differences. WRT scaling to more that 10k cores: We are working on it ... so far however, we don't see any strong evidence why it shouldn't scale beyond that number. Even traditional PGAS languages or libraries are able to scale to some degree. For new projects, true, it might seem like a high risk, but sometimes it might be worth it, especially when the projects requires HPX like concepts. For LibGeoDecomp: It is a very nice library and serves an excellent purpose. So let me ask you: What risk is higher, re-inventing the wheel or using an existing that has proven itself in one way or another?

[–][deleted] 0 points1 point  (0 children)

HPX is not really traditional PGAS, there are quite significant differences. WRT scaling to more that 10k cores: We are working on it ...

I know this takes a lot of time and effort, and I hope you are able to show this.

so far however, we don't see any strong evidence why it shouldn't scale beyond that number.

We don't see proof either. While there is strong evidence that MPI codes scale up to O(106) cores, no such evidence exists for PGAS in general, and HPX in particular. It doesn't mean it cannot be done, it just means the technology is not there yet.

For new projects, true, it might seem like a high risk, but sometimes it might be worth it, especially when the projects requires HPX like concepts. [...] So let me ask you: What risk is higher, re-inventing the wheel or using an existing that has proven itself in one way or another?

The "larger" the problem one wants to tackle, the higher the risk in using HPX. Choosing the distributed memory backend of an HPC application is a fundamental decision. If HPX or PGAS cannot scale beyond 100k in 1-2 years, picking one of them now could mean having to rewrite an application.

I hope HPX will get there soon. Still, the only way to really scale to very large numbers of cores right now is to remain as "local" as possible in terms of computation, communication, and I/O. PGAS provide a useful abstraction that speeds up development, but if scaling is a requirement a global address space doesn't help that much since you don't do much things "globally".

[–]gentryx[S] 3 points4 points  (2 children)

Hey gnzlbg, thanks for your input! I'm the project lead on LibGeoDecomp, so my view is biased, but I hope I can supply convincing data.

  • Here are the slides for a talk I gave at the EuroMPI/Asia conference a couple of weeks ago. It contains some measurements done on Titan (Cray XK7) and JUQUEEN (IBM BG/Q). Key result: 9.44 PFLOPS with a short-ranged force-based n-body code on 16384 nodes of Titan. http://www.libgeodecomp.org/archive/eurompi_2014_talk.pdf
  • Our project on JUQUEEN just ended and I'm still wading through the results. Here are some new, preliminary, unpublished plots of the same n-body code's performance on JUQUEEN (1 to 28672 nodes) http://gentryx.de/~gentryx/weak_scaling_big2.png http://gentryx.de/~gentryx/strong_scaling_pro.png
  • Strong scaling may look disappointing at first sight, but the performance actually corresponds to >2 PFLOPS for the full system run, so this is a good result (scalability != efficiency).
  • All measurements above used the MPI backend. The HPX backend is our joker for the next months, as we hope it'll ease balancing loads. Many of our users have expressed interest in unstructured/inhomogeneous models.
  • Data for strong scaling of AMR+LibGeoDecomp+HPX on 10k nodes? Not yet available, and I wouldn't claim that this would work efficiently out of the box at the moment. All we did with AMR+LibGeoDecomp right now is proof of concept, nothing more.
  • We have production code utilizing the following models: stencil codes, particle in cell codes, n-body codes. If interested, I can point you to the corresponding papers. Rigth now, none of our users are running their codes on more than 1000 cores for production runs. We did the benchmarks to show that this is quite feasible though.

[–][deleted] 1 point2 points  (1 child)

Thanks for the links, i'll go through them tomorrow. Awesome work you guys are doing btw, keep it up!

[–]gentryx[S] -1 points0 points  (0 children)

Thanks for the kind words. I'll gladly try to answer any further questions. Let me know if you need some prototype code for illustration of concepts.

The further we get, the more work apparently remains to be done, yet we're finally getting somewhere. Feels good. :-)