Archive for the ‘features’ Category

Non-blocking collective I/O

May 6th, 2015
Comments Off on Non-blocking collective I/O

The MPI standard defines non-blocking communication. It also defines non-blocking (independent) I/O. When it comes to collective I/O, the choices are blocking I/O or the little-developed and little-used “split collectives”.

The HDF Group pushed to add true non-blocking collective I/O to the MPI standard. MPI-3.1 finally incorporates this feature. The use cases are motivated by things the HDF5 library would like to do in a portable manner at scale:

  • Modifying metadata of a dataset: Each process has a cache of metadata, so updates done collectively (thus ensuring everyone’s cache is consistent between memory and file). When evicting members from this cache, HDF5 could issue a non-blocking collective I/O request for these typically tiny elements, then go do other work.
  • Backgrounding Data operations: HDF5 knows a bit about the structure of data on disk due to its file format. It also knows a bit about the data a user will want to operate on. A sufficiently clever HDF5 library could issue non-blocking collective I/O to either read-ahead in anticipation of what a user will need, or to maintain a write-back cache.

Sangmin Seo implemented the non-blocking collective I/O routines for ROMIO. Implementers might find it interesting that he used the extended generalized requests we added to MPICH way back in 2007.

This feature is available in mpich-master and in the last few pre-releases.  If you try out non-blocking collective I/O, let us know how it worked.


Deferred Open

August 5th, 2003
Comments Off on Deferred Open

When I came to Argonne in 2002, my second project was to implement “deferred open”, where we would skip opening the file if certain hints were given.  We never got around to writing a paper about this optimization, though.  There’s a brief mention in the ROMIO users guide , but it wouldn’t hurt to have a bit more documentation about this feature.

First, some background. ROMIO has an optimization for collective I/O called “two-phase collective buffering”.  When writing, ROMIO selects a subset of processes as “I/O aggregators” . These aggregators are the MPI processes that actually write data to the file, after collecting data from all the other processors.   When reading, these I/O aggregators read the data in some file-system friendly way, then scatter the data out to the other MPI processors.  Observe that in two-phase, the non-aggregator processes never touch the file.  We use this observation to implement a deferred open strategy for non-aggregators.

To enable deferred open, two hint conditions must be true

  • romio_cb_write and romio_cb_read must not be “disable”.  That’s the default setting for every file system everywhere, though:  it’s rare to find this condition not met
  • romio_no_indep_rw must be “true”.  With this hint, the user has told ROMIO “I will not do any independent I/O”.   ROMIO will then attempt to avoid opening the file on any non-aggregator processes.
  • optional: The cb_config_list and cb_nodes hints can be given to further control which nodes are aggregators

The “deferred” part comes from the fact that MPI Info tunables are hints, not contracts.   The user might lie to ROMIO, specifying “romio_no_indep_rw” to “true” and then go right ahead and carry out a bunch of independent I/O operations.  In that case, ROMIO will open the file just before the independent I/O operation happens — we say the open has been deferred.


features, tuning