Archive for the ‘Uncategorized’ Category

Useful Environment Variables

February 20th, 2019
Comments Off on Useful Environment Variables

I was surprised to find I hadn’t written anything about the environment variables one can set in ROMIO.

I don’t think I have a well-thought-out theory of when to use an environment variable vs when to use an MPI_Info key.

In this post, I’ll talk about ROMIO environment variables in general. The GPFS, PVFS2, and XFS drivers also have environment variables one can set, but those variables need to be set only in very specific instances.

  • ROMIO_FSTYPE_FORCE: ROMIO will pick a file system driver by calling stat(2) and looking at the fs type field. You can also prefix the path with a driver name (e.g. pvfs2:/path/to/file or ufs:/home/me/stuff), which bypasses the stat check. This environment variable provides a third way. Set the value to the prefix (e.g. export ROMIO_FSTYPE_FORCE="ufs:") and ROMIO will treat every file as if it resides on that file system. You are likely to get some strange behavior if you for instance try to make lustre-specific ioctl() calls on a plain unix file. I added the facility a while back in cases where it might be hard to modify the path or if I wanted to rule out a bug in a ROMIO driver
  • ROMIO_PRINT_HINTS: dump out the hints ROMIO is going to use on this file. Sometimes file systems will override user hints or otherwise communicate something back to the user through hints. Helpful to confirm what you think is going on (as in “hey, I requested this other optimization. Why is that optimization happening?”)
  • ROMIO_HINTS: used to select a custom “system hints” file. See system hints


New ROMIO features in MPICH-3.3

December 3rd, 2018
Comments Off on New ROMIO features in MPICH-3.3

It has been a while since the last official MPICH release, but shortly before US Thanksgiving, the MPICH team released MPICH-3.3.

ROMIO’s most noteworthy changes include

  • wholesale reformatting of all the code to the same coding styles. Sorry about your fork or branch.
  • use the MPL utility library already used in MPICH. Eliminates a lot of duplicated code
  • We started analyzing the code with Coverity. It found “a few” things for us to fix. Valgrind found a few things too.
  • Internal datatype representation (the “flattened” representation) now stored as an attribute on the datatype and not in an internal global linked list.
  • continued work to make ROMIO 64 bit clean
  • Deleted a bunch of unused file system drivers
  • Added DDN’s ‘IME’ driver.
  • Warn if a file view voilates the MPI-IO rule “monotonically non-decreasing file offsets”
  • added support for Lustre lockahead optimization


ROMIO at SC 2018

November 6th, 2018
Comments Off on ROMIO at SC 2018

If you would like to learn more about I/O in High Performance Computing, come check out our SC 2018 tutorial Parallel I/O in Practice. We will cover the hardware and software that makes up the software stack on large parallel computers. MPI-IO takes up a big chunk of time, as do the I/O libraries which typically sit on top of MPI-IO.

If you are interested in ROMIO, you hopefully are familiar with Darshan for collecting statistics and otherwise characterizing you your I/O patterns. These other Darshan-related events will likely be of interest to you:

And HDF5, which frequently sits atop ROMIO, will be having a Birds of a Feather session Wedesday.


Collective I/O to overlapping regions

September 6th, 2018
Comments Off on Collective I/O to overlapping regions

It is an error for multiple MPI processes to write to the same or to overlapping regions of a file. ROMIO will let you get away with this but if your processes are writing different data, I can’t tell you what will end up in the file in the end.

What about reads, though?

ROMIO’s two phase collective buffering algorithm handles overlapping read requests the way you would hope: I/O aggregators read from the file and send the data to the right MPI process. N processes reading a config file, for example, will result in one read followed by a network exchange.
As an aside, ROMIO’s two-phase algorithm is general and so not as good as a “read and broadcast” — if you the application/library writer know you are going to have every process read the file, here is one spot (maybe the only spot) where I’d encourage you to (independently) read from one processor and broadcast to everyone else.

I bet you are excited to go try this out on some code. Maybe you will have every process read the same elements out of an array. Did you get the performance you expected? Probably not. ROMIO tries to be clever. If the requests from processes are interleaved, ROMIO will carry on with two phase collective I/O. If the requests are not interleaved, then ROMIO will fall back to independent I/O on the assumption that the expense of the two-phase algorithm will not be worth it.

You can see the check here:
In 2018, two-phase is almost always a good idea — even if the requests are well-formed, collective buffering will map request sizes to underlying file system quirks, reduce the overall client count thanks to I/O aggregators, and probably place those aggregators strategically.

You can force ROMIO to always use collective buffering by setting the hint "romio_cb_read" to “enable” . On Blue Gene systems, that is the default setting already. On other platforms, the default is “automatic”, which triggers that check we mentioned.


New driver for DDN’s “Infinite Memory Engine” device

November 2nd, 2017
Comments Off on New driver for DDN’s “Infinite Memory Engine” device

Data Direct has a storage product called “Infinte Memory Engine“.   You can access this accelerated storage through POSIX but there is also a library-level native interface.

DDN recently contributed a ROMIO driver to use the IME “native” interfaces, and I have merged this into MPICH master for an upcoming release.  Thanks!

For most people, this new feature will only be exciting if you have a DDN storage device with IME.  If you do have such a piece of hardware, you can ask your DDN rep where to get the RPMs for the IME-native library.

I wrote a mocked version for anyone who wants to compile-time test this driver.  All it does is directly invoke the POSIX versions.  You can get ime-mockup from my gitlab repository 


Lustre tuning

March 7th, 2017
Comments Off on Lustre tuning

Getting the best performance out of Lustre can be a bit of a challenge. Here are some things to check if you tried out ROMIO on a Lustre file system and did not see the performance you were expecting.

The zeroth step in tuning Lustre is “consult your site-specific documentation”. Your admins will have information about how many Lustre servers are deployed, how much performance you should expect, and any site-specific utilities they have provided to make your life easier. Here are some of the more popular sites:

First, are you using the Lustre file system driver? Nowadays, you would have to go out of your way not to. One can read the romio_filesystem_type hint to confirm.

Next, what is the stripe count? Lustre typically defaults to a stripe count of 1, which means all reads and writes will go to just one server (OST in Lustre parlance). Most systems have tens of OSTs, so the default stripe size is really going to kill performance!

The ‘lfs’ utility can be used to get and set lustre file information.

$ lfs getstripe /path/to/directory
lmm_stripe_count:   1
lmm_stripe_size:    1048576
lmm_pattern:        1
lmm_layout_gen:     0
lmm_stripe_offset:  2
    obdidx       objid       objid       group
         2        14114525       0xd75edd      0x280000400

This directory has a stripe_count of 1. That means any files created in this directory will also have a stripe count of one. This directory would be good for hosting small config files, but large HPC input decks or checkpoint files will not see good performance.

When reading a file, there’s no way to adjust the stripe count. When the file is created, the striping is locked in place. You would have to create a directory with a large stripe count and copy the files into this new directory.

$ lfs setstripe -c 60  /my/new/directory

Now any new file created in “my/new/directory” will have stripe count of 60

If you care creating a new file, you can set the stripe size in ROMIO with the “striping_factor” hint:

    MPI_Info_set(info, "striping_factor", "32");
    MPI_File_open(MPI_COMM_WORLD, "foo.chkpt", MPI_MODE_CREATE, info, &fh);


HDF5-1.10.0 and more scalable metadata

June 9th, 2016
Comments Off on HDF5-1.10.0 and more scalable metadata

While not exactly ROMIO, the new HDF5 release comes with a nice optimization that benefits all MPI-IO implementations.

In order to know where the objects, datastets, and other information in an HDF5 file is located on disk, a process needs to read the HDF5 metadata. This metadata is scattered across the file, so to find out where everything is located a process will have to issue many tiny read requests. For a long time, each HDF5 process needed to issue these reads. There was no way for one process to examine the file and then tell the other processes about the file layout. When HDF5 programs were tens or hundreds of MPI processes, this read overhead was not so bad. As process counts get larger and larger in scale, as on for example Blue Gene, these reads started taking up a huge amount of time.

The HDF Group has implemented collective metadata in HDF5-1.10.0. With collective metadata, only one process will read the metadata and broadcast to the other processes. This optimization has worked quite well for Parallel-NetCDF and we’re glad to see it in HDF5. Hopefully, other I/O libraries will learn this lesson and adopt similar scalable approaches.

If you do any reading of HDF5 datasets in parallel, go upgrade to HDF5-1.10.0 .


Cleaning out old ROMIO file system drivers

January 5th, 2016
Comments Off on Cleaning out old ROMIO file system drivers

I’m itching to discard some of the little-used file system drivers in ROMIO.


  • GPFS: IBM’s GPFS file system still sees several key deployments
  • NFS: It’s everywhere, even though implementing MPI-IO consistency
    semantics over NFS is difficult at best

  • TESTFS: I find this debugging-oriented file system useful occasionally.
  • UFS: the generic Unix file system driver will be useful for as long as
    POSIX APIs are present.

  • PANFS: Panasas still contributes patches.
  • XFS: SGI’s XFS file system is part of SGI’s MPT, and they still contribute

  • PVFS2: Recent versions are called “OrangeFS”, but the API is still the same and still provides several optimizations not available in other file system drivers.
  • Lustre: deployed on a big chunk of the fastest supercomputers.


  • PIOFS: IBM’s old parallel file system for the SP/2 machine.
  • BlueGene/L: superseded by the BlueGene driver, itself superseded by GPFS.
  • BlueGene: The architecture-specific pieces were merged into a “flavor” of
    gpfs for Blue Gene.

  • BGLockless: this hack (see the bglockess page) lived on far longer than it
    should have.

  • GridFTP: I don’t know if this even compiles any more.
  • NTFS: MPICH has dropped Windows support for several years now.
  • PVFS: superseded by PVFS2 ten years ago.
  • HFS: The HP/Convex (remember convex?) parallel file system. I found a
    mention of a machine deployed in 1995.

  • PFS: the paragon (!) file system.
  • SFS: the “Supercomputing File System” from NEC.
  • ZoidFS: an old research project for a filesystem-independent protocol for
    I/O forwarding. While the ZoidFS driver might work, we know that folks
    trying to resurrect the old IOFSL project in 2015 are finding it…

Do you use a file system on the Deprecate/Delete list? Please let me ( know!


New ROMIO optimizations for Blue Gene /Q

June 5th, 2014
Comments Off on New ROMIO optimizations for Blue Gene /Q

The IBM and Argonne teams have been digging into ROMIO’s collective I/O performance on the Mira supercomputer. These optimizations made it into the MPICH-3.1.1 release, so it seemed like a good time to write up a bit about these optimizations.

no more “bglockless: for Blue Gene /L and Blue Gene /P we wrote a ROMIO driver that never called fcntl-style user-space locks.  This approach worked great for PVFS, which did not support locks anyway, but had a pleasant side effect of improving performance on GPFS too (as long as you did not care about specific workloads and MPI-IO features).  Now, we removed all the extraneous locks from the default I/O driver.  Even better, we kept the locks in the few cases they were needed: shared file pointers and data sieving writes.  Now one does not need to prefix the file name with ‘bglockless:’ or set the BGLOCKLESSMPIO_F_TYPE  environment variable.   It’s the way it should have been 5 years ago.

Alternate Aggregator Selection:  Collective I/O on Blue Gene has long been the primary way to extract maximum performance.  One good optimization is how ROMIO’s two-phase optimization will deal with GPFS file system block alignment.   Even better is how it selects a subset of MPI processes to carry out I/O.  The other MPI processes route their I/O through these “I/O aggregators”.    On Blue Gene, there are some new ways to select which MPI processes should be aggregators:

  • Default: the N I/O aggregators are assigned depth-first based on connections to the I/O forwarding node.    If a file is not very large, we can end up with many active I/O aggregators assigned to one of these I/O nodes, and some I/O nodes with only idle I/O aggregators.
  • “Balanced”:  set the environment variable GPFSMPIO_BALANCECONTIG to 1 and the I/O aggregators will be selected in a more balanced fashion.  With this setting, even small files will be assigned I/O aggregators across as many I/O nodes as possible.  (there’s a limit: we don’t split file domains any smaller than the GPFS block size)
  • “Point-to-point”:  The general two-phase algorithm is built to handle the case where any process might want to send data to or receive data from  any I/O aggregator.  For simple I/O cases we want the benefits of collective I/O — aggregation to a subset of processes, file system alignment — but don’t need the full overhead of potential “all to all” traffic.   Set the environment variable “GPFSMPIO_P2PCONTIG”  to “1” and if certain workload conditions are met — contiguous data, ranks are writing to the file in order (lower mpi ranks write to earlier parts of the file), and data has no holes — then ROMIO will carry out point-to-point communication among an I/O aggregator and the much smaller subset of processes assigned to it.

We don’t have MPI Info hints for these yet, since they are so new.  Once we have some more experience using them, we can provide hints and guidance on when the hints might make sense.   For now, they are only used if  environment variables are set.

Deferred Open revisited: The old “deferred open” optimization, where specifying some hints would have only the I/O aggregators open the file, has not seen a lot of testing over the years.  Turns out it was not working on Blue Gene. We re-worked the deferred open logic, and now it works again.   Codes that open a file only to do a small amount of I/O should see an improvement in open times with this approach.  Oddly, IOR does not show any benefit.  We’re still trying to figure that one out.

no more seeks: An individual lseek() system call is not so expensive on Blue Gene /Q.  However, if you have tens of thousands of lseek() system calls, they  interact with the outstanding read() and write() calls and can sometimes stall for a long time.  We have replaced ‘lseek() + read()’ and ‘lseek() + write()’ with pread() and pwrite().





August 5th, 2013
Comments Off on bglockless

Update: in MPICH-3.1.1 we finally scrapped bglockless, (see this writeup on 3.1.1 and Blue Gene enhancements)  but it’s still part of the system software on any BG /L BG /P or BG /Q machines.  The following writeup is perhaps of historical interest, but it will be a while (maybe never) before mpich-3.1.1 is the default MPI on Bue Gene /Q.

The IBM BGP MPI-IO implementation is designed to the “lowest common denominator”: NFS. So they’re performing some very conservative locking in their ADIO file system driver in order to try to get correct MPI-IO semantics out of what might be an NFS volume underneath.  It’s possible, though, to select an alternate driver that gives better performance in most cases — and terrible, terrible performance in one specific case.

The MPI routine MPI_File_open takes a string “filename” argument. Normally, ROMIO does a stat of the file system to figure out what kind of file system that file lives on, and then selects a “file system driver” (one of the ADIO modules) that might contain file system specific optimizations.

If you provide a prefix, like “ufs:” for traditional unix files, or “pvfs2:” or even “gridftp:”, then that prefix overrides whatever magic detection routines ROMIO would run, and the corresponding “ADIO driver” will be selected.

For Blue Gene /L (L, I tell you!) I wrote a ROMIO driver that made no explicit fcntl() lock calls.  Those lock calls are normally not a big deal, but PVFS v2 did not support fcntl() locks.   I called this driver ‘bglockless’.

our friends at IBM, in a conservative effort to ensure correctness for all possible file systems, wrapped every I/O operation in an fcntl() lock.  90% of these locks were unnecessary and served only to slow down I/O.

so, the half-day “driver with no locks” project I wrote for PVFS takes on a second life as the “make I/O go fast” driver.

Now here’s the catch, and why we can’t just make “bglockless” the default: certain I/O workloads, if locks are not available, must be carried out in a extremely inefficient manner.  Specifically, strided independent writes to  a file.   Certain rarely used functionality, like shared file pointers and ordered mode operations, are not implemented when locks are disabled.

For Blue Gene /P and /Q, one can set the environment variable BGLOCKLESSMPIO_F_TYPE to 0x47504653 (the GPFS file system magic number). ROMIO will then pretend GPFS is like PVFS and not issue any fcntl() lock commands.