HEP Data Analysis with Big Data and HPC
To provide an easy-to-use programming interface to access and analyze large-scale distributed dataset on HPC resources for data-intensive science.
Analyzing petabytes of experimental physics data requires repeated expensive evaluation of selection and filtering criteria. These evaluations consist of several intermediate steps, each requiring storage of smaller reduced data sets on disk, which can be later accessed interactively for end-user analysis. These steps can take anywhere from days to weeks. To improve time-to-physics and enable interactive analysis of the petabytes datasets without significant overhead of bookkeeping and intermediate files, we need to explore alternate approaches such as in-memory data processing. The new large-scale, state-of-the-art computing systems provide tremendous compute and memory resources causing us to rethink the traditional methods of doing HEP data analytics. For example, Cori, a supercomputer at NERSC, provides total memory of 203 terabytes for the Haswell nodes with sustained application performance of 83 TFlop/s. However, choosing programming abstractions to efficiently access and analyze HEP data on these big compute machines remains a challenge. In this project, using several use cases from the experimental HEP analyses, our goal is to find out the most suitable technology for HEP. We are considering MPI (the traditional HPC approach), Spark (an industry de-facto) and Dask (Distributed processing with Python).
Use case 1: Dark Matter search
As a first major use case, we implemented a Dark Matter search from the LHC CMS experiment using Scala/Spark. The CMS detector, using proton/proton collisions from the LHC at CERN, can be used to study the existence of Dark Matter. Our code is based on an existing C++ application that operates on data in ROOT format. However, the C++ application is written using serial event-based processing. The data consists of event information including all the particles that are produced as a result of particle collisions. Our implementation uses Spark SQL and Scala, i.e. a combination of declarative and functional programming to describe operations on event data.
The above figure shows current flow of operations and data, frequency of each operation and time taken by each operation. The proposed and implemented flow is shown in bold. The arrow shows the starting point of our setup, at which point the tabular data is converted to the HDF5 data format. Once data is loaded in memory in Spark DataFrames, the subsequent query operations take a couple of seconds and show potential for interactive analysis. The size of data in the first bold box doesn’t represent the conversion of 2TB, rather it only includes the particle data sets needed in this particular analysis task.
The above figure shows the initial Scaling results with Spark implementation on Edison for the CMS Dark Matter Analysis (360 million events). Step1-3 is the I/O time and Step 4 is the analysis time.
Details can be found at https://github.com/sabasehrish/spark-hdf5-cms
We would like to scale up to TBs using LArTPC data sets.
- “Spark and HPC for HEP data analyses”, at High Performance Big Data Computing workshop with IPDPS 2017
- “Exploring the Performance of Spark for a Scientific Use case”, at High Performance Big Data Computing workshop with IPDPS 2016