PI: Xian-He Sun, Illinois Institute of Technology

ANL Contact:  Kalyan Kumaran, ALCF

Project Description: As the boundary between High Performance Computing (HPC) and High-Performance Data Analytics (HPDA) continue to blur, the conventional compute-centric HPC and the newly emerged data-driven Big Data application are converging. HPC becomes more data-intensive. In the meantime, HPDA requires more computing power. MapReduce and Spark software environments are developed and are popular for HPDA. However, these software frameworks are not designed for HPC and not compatible with HPC storage subsystems. In this research, the design and development of a unified data access framework, named IRIS, is proposed for the integration of compute-centric and data-centric storage solutions.

The intellectual merit of this research is three-fold:

1) Mapping of incompatible structures: Mapping a file to key-value pairs and vice versa efficiently is a challenging task.

2) Maintaining metadata information: Since IRIS is a unified storage layer, it needs to maintain compatibility with legacy codes. IRIS will address this challenge with tunable consistency, which need to be carefully studied for implementation and design choices.

3) Minimizing overhead and memory footprint of IRIS solutions: Mapping of incompatible structures can cause excessive memory usage; this will be addressed in this research.

This project is expected to have significant impact, including bridging data generation and data analysis processes; promoting collaboration between the model simulation and data analysis communities; and building a foundation element for next generation integrated storage systems. This research will create advanced solutions and technologies that will have direct impact on improving the efficiency of data access and management at scale. Since Big Data is a national strategic infrastructure for science, engineering, and industry, this research will advance a broad range of fields. It aims to make significant progress toward a unified storage access system.

All data generated from this project will be stored in an electronic format and will be preserved on the server machines at the Illinois Institute of Technology (IIT): http://cs.iit.edu/~scs/. The server machines have hot-copy backup disks for backing up the primary copy of all data. A secondary copy of all data will be kept on the server machines in the computer science department at IIT per semester basis. The data will be transferred to new storage devices every 2 years. All data will be retained within 3 years of the project completion date. This award reflects NSF’s statutory mission and has been deemed worthy of support through evaluation using the Foundation’s intellectual merit and broader impacts review criteria.

Testbed: Open Intel Resources, including KNL machines. Schedule: Start from 03/14/2019, 6 months.