Simulation Scaling

Scalable Physics Simulations

The HEP-CCE has been working with experts at ALCF, OLCF, and NERSC to scale existing generator software to HPC resources. LHC event generators like Alpgen, Sherpa, MadGraph5_aMC@NLO, Pythia, and Herwig play an important role in producing simulated data for LHC experimental analyses. Some of this work is published HERE and has been presented at CHEP2015  (PRESENTATION).

Beyond Leading Order Calculations on HPCs

The CCE held a workshop in September 2016 which brought together experts from these generator communities, the perturbative and lattice QCD community, and the DOE HPC facilities. There was in depth discussion about how generators perform their calculations and how these algorithms could be altered or improved to reach millions of parallel threads on DOE supercomputers. One of the main outcomes of the workshop was the decision that a common MC integrator that runs on desktops as well as supercomputers would greatly benefit both generator developers and perturbative QCD theorists.

Scaling Alpgen to Millions of Threads

Our first step into this area of R&D involved scaling the Alpgen leading-order event generator on the Mira supercomputer at the Argonne Leadership Computing Facility. Event generators typically consist of two phases, integration and event generation. We focused on wrapping the Alpgen event generation phase with MPI to handle random number seeds and file I/O. This ensured each parallel rank of Algpen was not generating the same events and was not writing to the same output files. In addition, we coupled the weighted event generation phase to the unweighting by storing intermediate data files in node-level RAM-disks to reduce file I/O which was hindering performance. All of the details of this work has been published HERE. The work has also been presented at CHEP2015 (PRESENTATION).

Diagram of the workflow used to run parallel alpgen event generation on Mira at the scale of millions of parallel MPI ranks.
ALCC Usage during the 2014-2015 allocation, which was largely used for Alpgen event generation of precesses with vector bosons in association with up to five jets.
ALCC Usage at Mira during the 2014-2015 allocation, which was largely used for Alpgen event generation of precesses with vector bosons in association with up to five jets.

Scaling Geant4 to Three Million Threads

This is a compilation of things we have learned scaling up Andrea Dotti’s HepExpMT application to full machine jobs on ALCF’s Mira.

Threads: In the job script include the line SET G4FORCENUMBEROFTHREADS=64. This provides about 57x the throughput as a single thread, or 90% of perfect scaling. In general, the product of the number of ranks and threads per rank should equal the number of physical threads, and it’s best to run as many threads per rank as you can.

Random numbers: HepExpMT seeds random number generators with random numbers. This is probably not the best practice, as it doesn’t make the numbers any more random, but it does cause some problems. Fundamentally, RNGs produce integers, and if a floating-point number is desired, the integer is converted to one, usually by dividing by the largest allowed integer. Geant usually deals with double-precision numbers, but not all RNGs populate all 53 bits. If an RNG produced a 24 bit number, for example, there are only a million of these, so duplicate events are inevitable.

We kept the style of seeding (despite this being of questionable benefit) but instead seed with three random 24-bit numbers combined to become a single 64-bit number to serve as the basis of the actual seed. This prevents the event duplication we saw.

Scaling Baseline: If you just do this you can run in 512-node partitions (32K threads and the minimum Mira partition) with about a 20% loss in performance over a single mode. Going to 1024-node partitions causes an additional 30% loss, and going to 2048 is slower than 1024.

Scaling Improvement 1: HepExpMT writes out a status message with every event. While the total output is not unusually large, the sheer number of threads trying to write simultaneously to its own file drags the system down. Having each rank write to its own file is better, and placing all these files on a RAMdisk is better still. Doing this requires a file aggregation phase where the RAMdisk files are combined and written to physical disk. We have found that 128-512 ranks per output file is best: 128 is most performant at small scales, likely because there are 128 ranks sharing an I/O node, and 512 works best for very large partitions because you are not writing too many files to a single directory.

Once the output data is going elsewhere, the event status messages can be disabled. This is done by redirecting stdout and stderr to /dev/null. We do this for all but the first and last ranks. That allows us to ensure that the job is making progress without producing more output than can be handled.

Scaling Improvement 2: Geant4 has a “just in time” database reading strategy. This has the unfortunate effect of causing many processes to try and read the same files at the same time, but desequenced, which undermines system-level attempts to minimize duplicate reads. Placing some of the databases on a RAMdisk solves this – it more sensibly handles the multiple reads, reduces the number of disk reads, and subsequent reads are from memory. The Mira RAMdisk is limited to just over 1000 files, but fortunately only 244 files in the directories G4LEDATA, G4NEUTRONXSDATA, G4ENSDFSTATEDATA, and G4SAIDXSDATA are needed.

Scaling Improvement 3: These two changes allows for good scaling up to 48K jobs. However, about one event every few million never finishes. This has been tracked down to low energy (keV scale) particles, usually protons, in media of low density. Such particles take many steps with minimal energy loss, and thus take an inordinate amount of time for no real gain in precision. HepExpMT has a built in track killer. Enabling it with a threshold of 3800 steps causes it to be applied to 0.1% of events. Tightening the threshold to 1800 steps causes it to be applied to 1% of events, with no visible effect on the deposited energy. This reduces the average time for a 200 event per thread job from 101 minutes to 95, but perhaps more importantly the standard deviation of this time from 143 seconds to 69 seconds.
With these changes, we can run at scale with highly predictable times. We’ve generated four billion events using 16K partitions (over one million threads) and about a billion using 48K partitions – all of Mira. There have been some segment violations at the one-per-billion level; the common factor in these failures was that they ran in a partition containing a particular rack, one that later went down for hardware problems. This is not conclusive, of course, but it is suggestive, particularly since similar behavior was observed on Cetus, and there failing jobs running on different partitions succeeded.

An HEP Collision Point