Archive for the ‘development’ Category

lustre, pread/pwrite and caddr_t

February 26th, 2015
Comments Off on lustre, pread/pwrite and caddr_t

I just pushed some changes to MPICH master to help ROMIO deal with useful  features that might not be available on all platforms and configurations.

The system calls pread(2) and pwrite(2)  act just like read(2) and write(2), but they take an additional offset parameter.  The intent was to make life easier for threaded applications, but these system calls help with ROMIO scalablity, too.  The lseek(2) system call isn’t terribly expensive most of the time, but on some platforms with system call forwarding (e.g. Blue Gene/ Q), the extra system call can contribute to congestion at the system call forwarding interface.

Perhaps foolishly, I defined _XOPEN_SOURCE to 600 in ROMIO.   this worked fine on all platforms I could test, but some of our Lustre using friends have run into problems.    ROMIO’s lustre driver includes lustre.h.  Lustre.h includes quota.h and quota.h tries to use a caddr_t datatype — one that is not defined if _XOPEN_SOURCE is set to 600.

This problem can show up if you are building ROMIO, MPICH, MVAPICH, or OpenMPI with lustre, and looks like this:

In file included from /usr/include/linux/lustre_user.h:46,
                 from /usr/include/lustre/lustre_user.h:54,
                 from adio/ad_lustre/ad_lustre.h:30,
                 from adio/ad_lustre/ad_lustre_rwcontig.c:20:
/usr/include/sys/quota.h:221: error: expected declaration specifiers or 
'...' before 'caddr_t'

In this case, what I decided to do was simply provide a re-implemented pread/pwrite in ROMIO if the pread/pwrite prototypes don’t exist.

If you ran into this problem, sorry about that.  If you found this page because you are facing a similar problem, please try the latest MPICH.  The two changes are the following: and


helpful GDB macro

August 7th, 2014
Comments Off on helpful GDB macro

The fundamental data structure in ROMIO is the “flattened representation” of a dataype: this list of  “offset-length” pairs describes any MPI datatype, if perhaps at the cost of memory and computational overhead.  For years I have been linking in little utility functions to dump out these lists.  Turns out GDB macros can do this for me.  I added this to my .gdbinit:

define dump_flattened
set $flat = $arg0
set $i = 0
while $i < $flat->count
printf "(%ld, %ld)", $flat->indices[$i], $flat->blocklens[$i]
set $i = $i + 1
printf "\n"
document dump_flattened
Display ROMIO flattened representations (offset-length pairs) of datatypes
example usage: dump_flattened flat_type

then in gdb I can simply invoke ‘dump_flattened’ on the flat list node:

(gdb) p flat_buf
$1 = (ADIOI_Flatlist_node *) 0x638b68
(gdb) dump_flattened flat_buf
(0, 1296)(0, 1296)(0, 1296)

Hopefully future ROMIO hackers will benefit from this sooner than I did.

(note: an earlier version of this post had a bug in it that did not increment the blocklens[] index)


More headaches with 2 GiB I/O

July 11th, 2014
Comments Off on More headaches with 2 GiB I/O

There are quite a few headaches when updating ROMIO and MPICH to be 64 bit clean. Once you’ve gone through all the hurdles to promote types and squash that last overflowing integer parameter, you are not done yet!

Take a look at what happens on OS X (Darwin) when you pass 2147483648 (231) bytes to the write system call:

dofilewrite(vfs_context_t ctx, struct fileproc *fp, user_addr_t bufp, user_size_t nbyte, off_t offset, int flags,user_ssize_t *retval)
    uio_t auio;
    long error = 0; 
    user_ssize_t bytecnt;
    char uio_buf[ UIO_SIZEOF(1) ];

    if (nbyte > INT_MAX)
        return (EINVAL);

It’s entirely within the standard for write or read to return less than
requested, but this is not what darwin is doing: it’s straight-out erroring.

FreeBSD 9.2 and 10.0 exhibit similar behavior, which you can see in sys/kern/sys_generic.c

sys_read(td, uap)
        struct thread *td;
        struct read_args *uap;
        struct uio auio;
        struct iovec aiov;
        int error;

        if (uap->nbyte > IOSIZE_MAX)
                return (EINVAL);

So in commit 5b674543 I simply ensure no write is larger than a signed 32 bit integer can hold.


Large transfers in ROMIO

July 3rd, 2013
Comments Off on Large transfers in ROMIO

Let’s say you had a fat node with lots of memory, and you wanted to write 2 GiB of data out to a file with MPI-IO.  How would you do that?  You would naturally look at


or one of its variants. This is the prototype for MPI_File_write_all:

int MPI_File_write_all(MPI_File fh, const void *buf, int count,
                       MPI_Datatype datatype, MPI_Status *status)

Perfect! Even though C ‘int’ types are 32 bits on most platforms, 2^31 (or anything less than 2147483649) will fit into a signed int. But until recently, when we try to do this in ROMIO, we fail.

It turns out we were, despite ROMIO being 15 years old, using POSIX system calls incorrectly. The write(2) system call doesn’t actually have to write out all the data you asked it to. It’s perfectly legal to return success, but only write a few bytes of your request. For most transfers, though, a successful write and a “full write” were the same thing, and so we got by for years without testing for “short writes”. Until recently:

I recently fixed this in git , so folks who were having difficulty with large-ish transfers (in 2013, 2 GiB isn’t that large) should enjoy the next MPICH release.

Note that this fix is not the same as transferring more than 2 GiB of data.  That work requires a bit more attention to the MPI type system.  I’ll write up a bit about that some other time.