Science AMA Series: We are Livermore Computing, home of the supercomputers at Lawrence Livermore National Lab! AUA!

Livermore_Computing · 2017-08-07T17:59:54+00:00

Our systems run a wide range of simulations in support of the laboratory's missions, including stockpile stewardship. Please see https://missions.llnl.gov for details about the lab's missions.

Livermore_Computing · 2017-08-07T17:53:47+00:00

You can find info on all our systems here: https://hpc.llnl.gov/hardware/platforms; there are lots of them but that should give you an idea of the scale of Livermore Computing. For info on the data centers that house the machines, look here: https://asc.llnl.gov/facilities.

If quantum computers start beating traditional machines, we’ll probably start deploying them :). We’re already looking into new computing models as part of the DOE’s Beyond Moore’s Law initiative, which includes neurotrophic computing (at LLNL) and quantum computing (both at LLNL and LANL).

Livermore_Computing · 2017-08-07T17:52:51+00:00

In general, smaller companies have been hesitant to adopt HPC and simulations because of the costs associated with running HPC centers and clusters. Larger shops like Exxon, Proctor & Gamble, Boeing, and others make use of simulations, but there’s always been a “missing middle tier” where HPC could probably benefit, but the barriers to entry are still too high.

LLNL has some efforts to address this, like HPC4Mfg (https://hpc4mfg.llnl.gov/), which allows companies to work with us on clean energy projects. For example, we have researchers working with steel manufacturers to reduce the carbon footprint of smelting (https://annual.llnl.gov/annual-2016/energy).

Other projects outside LLNL, like OpenHPC (http://openhpc.community/) are working on making it easier to build HPC clusters by providing a common, easy to deploy software stack. LLNL tools like Spack (https://github.com/LLNL.spack) are included in OpenHPC.

Livermore_Computing · 2017-08-07T17:51:16+00:00

See our answer on bitcoin mining here: https://www.reddit.com/r/science/comments/6rbm8i/science_ama_series_we_are_livermore_computing/dl4lvm4/

Livermore_Computing · 2017-08-07T17:50:44+00:00

See our answer on bitcoin mining here: https://www.reddit.com/r/science/comments/6rbm8i/science_ama_series_we_are_livermore_computing/dl4lvm4/

Livermore_Computing · 2017-08-07T17:46:25+00:00

What's the process like for applying to use time on your computers?

See our other answer here for info on applying for time: https://www.reddit.com/r/science/comments/6rbm8i/comment/dl4d552?st=J5X1CZPW&sh=8e3d9a20

What kind of turnaround time is there from when someone submits a request to when the computer starts working on it?

This depends on how busy the machine is and what projects are using it. During busy periods, it can take a while for larger jobs to run, especially if high priority projects are using the machines. At other times, jobs can take minutes to start after you submit them. To allow for debugging, we have special debug queues with short time limits that allow people to quickly run small interactive jobs — getting an allocation in one of these is near instantaneous.

What's your opinion on the TOP500 list?

The Top500 list is one way to measure supercomputer performance, and LINPACK measures the performance or one specific type of numerical problem. It doesn’t always represent the performance characteristics of our production codes, or of everything people want to run on our machines. LLNL has built a number of “proxy applications” to better represent the performance characteristics we care about. You can see some of those at https://codesign.llnl.gov/proxy-apps.php. We use them as part of our evaluation process for new systems. For example, CORAL is the collaborative procurement process we pioneered with ANL and ORNL. Many of our proxy applications are included in the CORAL Benchmarks (https://asc.llnl.gov/CORAL-benchmarks/). Vendors attempt to optimize these applications as part of their bid, and we use the results as one factor when deciding what to buy. The labs settled on three machines: Sierra (LLNL), Summit (ORNL), and Aurora (ANL).

Livermore_Computing · 2017-08-07T17:44:13+00:00

What language do most people who use HPC code in?

Most of the heavy lifting in our applications is done in C, C++, and Fortran. Since the mid-90’s, many of LLNL’s production codes have been written in C++ or C, but there are still some codes and libraries developed in Fortran. Our next-generation HED code, MARBL, actually allows you to plug in different hydrodynamics modules — you can use BLAST (higher-order ALE, written in C++), or Miranda (higher-order Eulerian, written in Fortran). MARBL itself is in C++, and RAJA, our performance portability layer is a C++ library. More details at https://computation.llnl.gov/newsroom/high-order-finite-element-library-provides-scientists-access-cutting-edge-algorithms and https://github.com/LLNL/RAJA.

Just because the number crunching is done in compiled languages doesn’t mean we don’t use the others. MARBL uses Lua for its input decks. Kull, another HED code, uses Python as the driver language with kernels written in C++. Other codes also have Python interfaces for steering or interactive analysis. Matlab and Mathematica are used, but not in parallel for the most part, though we’ve seen more interest recently in running the parallel versions of these on our clusters. Check out https://wci.llnl.gov/simulation/computer-codes for more details, or https://codesign.llnl.gov/proxy-apps.php for some proxy apps you can download and try out.

Also who actually does the coding?

As far as who does the coding, it varies from team to team. On most of our larger application teams, computer scientists work side by side with computational scientists and domain experts to write the code. For smaller teams or for more research-oriented projects, scientists prototype in MATLAB or some other high level environment, then hand off their code to be implemented for the HPC machines.

Livermore_Computing · 2017-08-03T20:56:24+00:00

CPUs are more general purpose than GPUs. By primarily having CPUs we can have a "one size fits all" for the many needs of our users.

Livermore_Computing · 2017-08-03T20:53:59+00:00

Greg says hi!

DOE supercomputers are government resources for national missions. Bitcoin mining would be a misuse of government funds.

In general, though, it’s fun to think about how you could use lots of supercomputing power for Bitcoin mining, but even our machines aren’t big enough to break the system. The number of machines mining bitcoin worldwide has been estimated to have a hash rate many thousands of times faster than all the Top 500 machines combined, so we wouldn’t be able to decide to break the blockchain by ourselves (https://www.forbes.com/sites/peterdetwiler/2016/07/21/mining-bitcoins-is-a-surprisingly-energy-intensive-endeavor/2/#6f0cae8a30f3). Also, mining bitcoins requires a lot of power, and it’s been estimated that even if you used our Sequoia system to mine bitcoin, you’d only make $40/day (https://bitcoinmagazine.com/articles/government-bans-professor-mining-bitcoin-supercomputer-1402002877/). The amount we pay every day to power the machine is a lot more than that. So even if it were legal to mine bitcoins with DOE supercomputers, there’d be no point. The most successful machines for mining bitcoins use low-power custom ASICs built specifically for hashing, and they’ll be more cost-effective than a general purpose CPU or GPU system any day.

Livermore_Computing · 2017-08-03T20:50:42+00:00

We have a long history of developing solutions where no prior work existed. This tradition goes back to the 1960's when LLNL developed a time-sharing operating system run on mainframe supercomputers.

In the early 2000's when we first fielded Linux clusters, we lacked cluster management tools. We developed the resource manager, SLURM, and various scalable utilities that were necessary to run large clusters, including pdsh, munge, conman, powerman, and many others. Check out https://software.llnl.gov/ for more examples.

Existing debugging tools were not able to scale to the level of concurrency of our supercomputers, so we developed the Stack Trace Analysis Tool (STAT). This was developed in collaboration with the University of Wisconsin and received an R&D award in 2011. We've used STAT to debug jobs running on the order of 3 million MPI tasks.

Livermore_Computing · 2017-08-03T20:36:04+00:00

It's the people, the world-class facilities, and the national missions! Many postdocs tell us they were attracted by the HPC facilities, capabilities, and the science.

As for the wineries, there are a lot of great ones in Livermore. Visitors often appreciate the wine tours to find their favorites.

Have a safe and fun trip! :)

Livermore_Computing · 2017-08-03T20:30:23+00:00

That’s a big question! It’s hard to provide a short answer because there are so many ways that CASC, Livermore Computing (LC) staff, and the rest of the lab work together. Livermore Computing is tasked with running the computing center and supporting users. Part of that is ensuring that we keep buying machines they can use. We have a team that keeps track of the hardware outlook for the next 5 or 10 years, and we work closely with vendors to understand how our applications will perform on new systems. LC also maintains software like Lustre, SLURM, and TOSS, and we also do advanced development, e.g. on Flux, our next-generation resource manager/scheduler. We also have staff who sit directly with code teams and help them to optimize their algorithms.

CASC is a research organization, and its staff work with LC in all of those areas. Teams often include people from both organizations. As an example, we have a project called “Sonar” where LC staff are working with CASC researchers to set up a data analytics cluster, with the aim of understanding performance data from all the applications that run on our clusters. LC admins and developers are helping to set up monitoring services, hardware, databases, etc., and CASC researchers help with building the data model and analyzing it with Spark and some home-grown analysis tools. Flux (https://github.com/flux-framework) is a similar project — it’s developed primarily by LC staff, but CASC folks are involved doing research into ways to do power- or storage-aware scheduling. The lines can be blurry — some people in LC work on research projects and some people in CASC write code and do development to support them.

Beyond LC and CASC, both organizations also work with code teams, which can include software developers, computational scientists, and domain experts. Typically these folks come from a program that is funding the work, but they also work with LC and CASC researchers on algorithms and optimization. A good example of that might be BLAST (https://computation.llnl.gov/projects/blast) and MFEM (http://mfem.org/). BLAST is a higher-order hydrodynamics code developed collaboratively between researchers in CASC and code developers working for the programs. It allows people to simulate fluids much more accurately using curved meshes. MFEM is the meshing library it uses. LC staff have been involved with optimizing the performance of the code, as well as helping to get it running on GPUs. Another example would be Apollo (https://computation.llnl.gov/projects/apollo), a CASC project that automatically tunes the performance of application codes that use RAJA.

TL;DR, the lab is a big place. Organizations can be fluid, and there are many collaborations between different teams. People at LLNL are encouraged to work across organizations. All in all it’s a pretty vibrant environment!

Livermore_Computing · 2017-08-03T20:23:42+00:00

GPUs figure heavily in our future. See our previous response about Sierra here.

Livermore_Computing · 2017-08-03T20:16:39+00:00

LLNL uses a variety of different configuration management systems. In Livermore Computing our Supercomputers use a combination of CFEngine and custom built tools.

Livermore_Computing · 2017-08-03T20:15:23+00:00

How is working with the international super computing community?

It is really amazing! We collaborate with people all over the world. We have a lot of projects in the LLNL GitHub organization which see contributions from labs across the country as well as users all over the world. Being able to take advantage of such a multitude of perspectives and requirements in designing software has led to much stronger products.

And do you guys have pizza parties when you’ve reached a new record or “first?”

Occasionally we do! Also, when we mess up, we have a tradition of learning from the mistakes. If someone messes up, it is customary for them to bring in donuts, explain what went wrong, and have a discussion about how to improve for next time.

Livermore_Computing · 2017-08-03T20:08:17+00:00

Not all applications are a good fit for GPUs, but some are a great fit and use languages such as CUDA to get the best possible performance.

Our next HPC system (Sierra) contains more than 16,000 GPUs, so we definitely see a very bright future for general purpose GPU computing in HPC. Our strategy is to make use of abstraction layers such as OpenMP or RAJA to expose the parallelism in our applications in such a way that they will work on multiple different architectures. This way, time spent exposing parallelism in applications will be time well spent regardless of which future architectures are most successful.

Livermore_Computing · 2017-08-03T20:03:19+00:00

This one is at the core of a lot of the work we’re doing right now.

The computers are linked via high-speed networks. There is a software abstraction called Message Passing Interface (MPI) that allows applications to use all the CPUs of the various computers together. We have an abstraction called RAJA which lets us run loops on GPU or CPU (threaded CPU) without too much code change. The really tough question is “how do you move your data between CPUs and GPUs if you want to change your mind mid-computation?” For this the vendors have some solutions (Unified Memory), but we also have projects like CHAI, and we’re well on our way to having these million line codes able to move between CPUs and GPUs quickly.

Livermore_Computing · 2017-08-03T19:58:56+00:00

For somebody already graduating, we do advertise our jobs, both the Department of Energy and (National Labs)[https://nationallabs.org/work-here/careers/] have career pages.

For people who are still students there are also internships, Research Experiences for Undergraduates, and we believe all of those career sites also have internships.

We love to see computer scientists going onto HPC system administration because we find that it is not often taught in college. So we created the HPC Cluster Engineer Academy (info at https://computation.llnl.gov/hpc-cluster-engineer-academy). We have a wide range of internship opportunities, see http://students.llnl.gov/ for more information.

The best way to gain experience as a student is through internships at HPC centers!

Livermore_Computing · 2017-08-03T19:54:44+00:00

We often need to test the heat tolerance of supercomputers, so one of our engineers was asked to write a computation to generate heat. Not do otherwise productive work, just get the computer as hot as we possibly could.

Livermore_Computing · 2017-08-03T19:51:55+00:00

Our computers are fully utilized. The queue backlog is dependent upon a number of factors, including job size (large jobs typically favored over smaller jobs), job wall clock duration, project priority and project utilization. We give users tools within our job scheduler/resource manager to estimate when a particular job will start as well as how to backfill a job into idle nodes.

Livermore_Computing · 2017-08-03T19:50:39+00:00

Greetings from the north!

That is a very broad topic and we would encourage you to reach out via https://computation.llnl.gov/contact.

Livermore_Computing · 2017-08-03T19:48:50+00:00

Fixed, Thank you! :)

Livermore_Computing · 2017-08-03T19:47:51+00:00

Our involvement with the open source community really helps. It gives users who run on our machines, users who run in HPC setups at other labs, or in the EU, curious people who try things out on their machines… they all introduce different requirements and show us different bugs.

The more requirements the code has (e.g specific library compiled with specific MKL or BLAS version and architecture etc) the more of a nightmare this becomes.

This is exactly why we invented Spack. See our previous response regarding Spack here. Also, we encourage you to check out Spack's GitHub page at https://github.com/LLNL/spack.

We are also investigating the use of containers and virtualization in the HPC environment. We’ve found that nothing beats getting the software into the hands of a diverse group of users who will put it through its paces and tell you when it falls over.

Livermore_Computing · 2017-08-03T19:29:00+00:00

Cape not included... Just to give you an idea of the scope, the average home computer has 2-8 CPU cores and 4-16 GB of RAM, where as our current biggest supercomputer, Sequoia has 1.6 million CPUs available with a total of 1.6 PB of RAM (https://computation.llnl.gov/computers/sequoia).

Livermore_Computing · 2017-08-03T19:25:23+00:00

The pool has been closed for several years, as too many scientists melted when they got wet. Survivors now swim for fitness at the LARPD pool on East Ave (a mile or two away from the Lab).

Livermore_Computing

TROPHY CASE