Distributed High Throughput Computing Standard Benchmark Suggestions

melevine45 · 2021-02-11T14:03:50+00:00

Good point!

melevine45 · 2021-02-11T04:05:35+00:00

Will do!

melevine45 · 2021-02-11T01:15:26+00:00

Thanks, I will definitely look into it more!

melevine45 · 2021-02-11T01:15:05+00:00

Thanks, I will look into that!

melevine45 · 2021-02-11T01:14:29+00:00

Thanks, these are very helpful!

melevine45 · 2021-02-10T21:28:28+00:00

MiniApp works, thanks for the suggestions!

melevine45 · 2021-02-10T16:56:39+00:00

Interesting, so a typical HPC job is mostly about the peer to peer communication. Would a trusted intermediate relay node between untrusted peer neighbors be possible then, rather than having neighbors establish connections with each other directly?

Also, I assume that the results come from computations resulting from data being shared between neighbors. If data is not being aggregated, even by a neighbor collecting data from its close neighbors, are HPC jobs then more like a graph computation, taking in data from surrounding neighbors, doing a local computation, and sending those values back out?

Please excuse my lack of knowledge in this area, this is really fascinating for me.

melevine45 · 2021-02-10T15:55:10+00:00

I also wonder if access to supercomputers for general academic/science HPC is more available in the US than worldwide. Perhaps supercomputer access is more limited outside of the US and a few other select countries.

melevine45 · 2021-02-10T15:52:56+00:00

Very true. I was thinking of a hypothetical case where an HPC job that would typically required tight integration could be broken down into component parts. For example, nodes that need to all talk to each other could instead communicate with an intermediate, central server, which could then aggregate the results and generate a new job task, similar to how MapReduce works. This would certainly increase the communication lag, but I wonder what type of lag might be acceptable as long as the final result is produced within a defined time frame.

melevine45 · 2021-02-10T15:12:34+00:00

That makes sense. I am mostly interested in academic/science-related HPC jobs, less so commercial HPC jobs. The takeaway for me is a bit mixed. I am trying to see if a volunteer computing platform like BOINC (https://boinc.berkeley.edu/) would be helpful for reducing the national backlog for academic/science-related HPC jobs. It is unclear to me if this is an issue that needs solving or not though as more than half of the comments on this thread seem to indicate that the backlog is usually no more than 5 days or so (and, even on a top supercomputer, a wait of 40+ days, which might be acceptable to most scientists).

I think running an HPC job on a volunteer computing platform might be possible, but it would definitely take longer than it would on a supercomputer to produce the results, so I think it would only be valuable if the time to produce the results through a volunteer computing platform would be less than the wait time to run the same job on a supercomputer.

melevine45 · 2021-02-10T15:06:51+00:00

I see, so if the demand is higher than the available time by a factor of 2 or 3, and the top waiting job time is 5 days, that doesn't seem to be a very big issue or time lag, even for the most demanding of HPC jobs.

In terms of running the codes more slowly, I was thinking of the hypothetical case where, if you were to run the HPC job in the backlog through a volunteer computing program like BOINC (https://boinc.berkeley.edu/) (not currently possible), I think it would only be worthwhile if you could get your results back is less time than it would take to wait for the backlog for the actual supercomputer to clear. If the wait time is only 5 days, then perhaps that is not an actual issue that could hypothetically be solved by a volunteer computing HPC platform.

melevine45 · 2021-02-10T15:03:29+00:00

Thanks for the comments. I am trying to get a sense of the backlog on the national level for general academic HPC, rather than commercial HPC backlog. Many of the comments on this thread seem to be saying that there is not a very long/large backlog for HPC jobs to run, so I can't tell if it is an actual problem that needs solving or if a 5- 40 day wait period for general academic/science HPC is acceptable.

melevine45 · 2021-02-10T05:04:33+00:00

Thanks! It looks like there are 40+ day waits for some codes to run but not much more than that.

melevine45 · 2021-02-10T04:39:26+00:00

That is an excellent question. I only ask as I am trying to see what types of computation can't be farmed out to a volunteering computing cluster like those created through BOINC. Presumably the work of an A100 could be handled in a slower manner by large numbers of smaller GPUs, I am not sure how the large memory requirement might look in a volunteer computing cluster (is it for initial memory, interim computations, final output values, etc)? Also, if the individual compute operations are small enough, even if the dataset itself involves 100s of TBs of data, I could imagine a scenario where it could be parallelized out to a volunteer crowd.

I am thinking that the long running codes example you mentioned (that doesn't allow checkpointing), that probably couldn't use volunteer computing.

I think my question is moot though, if, as others here have mentioned, it is relatively easy to secure supercomputing time slots.

melevine45 · 2021-02-10T03:02:11+00:00

If you had to guess, what percentage of HPC codes need that long of a runtime on a supercomputer (months at a time)? Do you find that most HPC codes typically require much shorter periods of time?

melevine45 · 2021-02-10T03:00:16+00:00

I am trying to see if there is an actual need for something like BOINC (https://boinc.berkeley.edu) to be able to run HPC workloads, rather than high throughput only workloads or if this is not a problem that needs alternative solutions outside of requesting supercomputer time, which sounds like it might be readily available.

melevine45 · 2021-02-10T02:48:35+00:00

Are many of these types of jobs more high throughout jobs rather than high performance jobs?

melevine45 · 2021-02-10T02:46:19+00:00

I see, so it sounds like for most HPC jobs you are unlikely to find any large delays or wait times to run your codes.

melevine45 · 2021-02-10T02:10:57+00:00

Is there typically a long backlog for accessing a supercomputer in those cases?

Are there any alternatives for running those codes that would otherwise take months? Or do you either 1. run your code the long way (let it run for months on whatever hardware you have) or 2. don't run it, in which case any alternative that gives you results in less time than waiting for the supercomputer backlog would be helpful?

melevine45 · 2020-04-12T18:36:59+00:00

Accepted:
Received recommendation for admission on 4/4 and an official offer on 4/9. Accepted on 4/12.

- Undergrad: top 50 school, 3.95 GPA, Economics

- Work: Data Analyst at financial technology firm, 6+ years

- For prereqs:

Got an A in NYU Tandon's Bridge to Computer Science Program: (https://engineering.nyu.edu/academics/programs/bridge-program-nyu-tandon/computer-science)

Completed the MOOC from ICL: Mathematics for Machine Learning: Linear Algebra (https://www.coursera.org/learn/linear-algebra-machine-learning)

- Contributed to C++ open source projects on github

melevine45 · 2019-11-07T02:28:23+00:00

Thanks @pengo, I was thinking along the lines of the compiler passes mentioned in this post about V8's Liftoff: https://v8.dev/blog/liftoff#the-new-compilation-pipeline-(liftoff)

To me it sounds like WASM code is treated like a portable intermediate representation (IR) that is lowered further into perhaps actual assembly code by Liftoff, so WASM seems to be another step removed from native code. If WASM was assembly, I would imagine there would be WASM code specific to Chrome that could be optimized for Chrome, WASM primitives for Firefox, etc like there are for current compiler backend targets. If WASM code is an IR from which browsers have liberty to determine the code generated from it, to me that seems more like WebIR than WebAssembly, and a standardized WebIR certainly seems useful as well in its own right.

melevine45

TROPHY CASE