AMPI #952: update ROMIO to MPICH2-1.4.1p1
[charm.git] / src / libs / ck-libs / ampi / romio / doc / source-guide.tex
blob3e1f7f8434770dd4189cc912e5165197ce600862
1 % \documentstyle[11pt,psfig]{article}
2 \documentclass[11pt]{article}
3 \hoffset=-.7in
4 \voffset=-.6in
5 \textwidth=6.5in
6 \textheight=8.5in
8 \begin{document}
9 \vspace*{-1in}
10 \thispagestyle{empty}
11 \begin{center}
12 ARGONNE NATIONAL LABORATORY \\
13 9700 South Cass Avenue \\
14 Argonne, IL 60439
15 \end{center}
16 \vskip .5 in
18 \begin{center}
19 \rule{1.75in}{.01in} \\
20 \vspace{.1in}
22 ANL/MCS-TM-XXX \\
24 \rule{1.75in}{.01in} \\
26 \vskip 1.3 in
27 {\Large\bf A Guide to the ROMIO MPI-IO Implementation } \\
28 by \\ [2ex]
29 {\large\it Robert Ross, Robert Latham, and Rajeev Thakur}
30 \vspace{1in}
32 Mathematics and Computer Science Division
34 \bigskip
36 Technical Memorandum No.\ XXX
39 % \vspace{1.4in}
40 % Revised May 2004
42 \end{center}
44 \vfill
46 {\small
47 \noindent
48 This work was supported by the Mathematical, Information, and
49 Computational Sciences Division subprogram of the Office of Advanced
50 Scientific Computing Research, U.S. Department of Energy, under
51 Contract W-31-109-Eng-38; and by the Scalable I/O Initiative, a
52 multiagency project funded by the Defense Advanced Research Projects
53 Agency (Contract DABT63-94-C-0049), the Department of Energy, the
54 National Aeronautics and Space Administration, and the National
55 Science Foundation.}
57 \newpage
60 %% Line Spacing (e.g., \ls{1} for single, \ls{2} for double, even \ls{1.5})
63 \newcommand{\ls}[1]
64 {\dimen0=\fontdimen6\the\font
65 \lineskip=#1\dimen0
66 \advance\lineskip.5\fontdimen5\the\font
67 \advance\lineskip-\dimen0
68 \lineskiplimit=.9\lineskip
69 \baselineskip=\lineskip
70 \advance\baselineskip\dimen0
71 \normallineskip\lineskip
72 \normallineskiplimit\lineskiplimit
73 \normalbaselineskip\baselineskip
74 \ignorespaces
76 \renewcommand{\baselinestretch}{1}
77 \newcommand {\ix} {\hspace*{2em}}
78 \newcommand {\mc} {\multicolumn}
81 \tableofcontents
82 \thispagestyle{empty}
83 \newpage
85 \pagenumbering{arabic}
86 \setcounter{page}{1}
87 \begin{center}
88 {\bf Users Guide for ROMIO: A High-Performance,\\[1ex]
89 Portable MPI-IO Implementation} \\ [2ex]
90 by \\ [2ex]
91 {\it Rajeev Thakur, Robert Ross, Ewing Lusk, and William Gropp}
93 \end{center}
94 \addcontentsline{toc}{section}{Abstract}
95 \begin{abstract}
96 \noindent
97 ROMIO is a high-performance, portable implementation of MPI-IO (the
98 I/O chapter in \mbox{MPI-2}).
99 This document describes the internals of the ROMIO implementation.
100 \end{abstract}
102 \section{Introduction}
104 The ROMIO MPI-IO implementation, originally written by Rajeev Thakur, has been
105 in existence since XXX.
107 ... Discussion of the evolution of ROMIO ...
109 Architecturally, ROMIO is broken up into three layers: a layer implementing
110 the MPI I/O routines in terms of an abstract device for I/O (ADIO), a layer of
111 common code implementing a subset of the ADIO interface, and a set of storage
112 system specific functions that complete the ADIO implementation in terms of
113 that storage type. These three layers work together to provide I/O support
114 for MPI applications.
116 In this document we will discuss the details of the ROMIO implementation,
117 including the major components, how those components are implemented, and
118 where those components are located in the ROMIO source tree.
120 \section{The Directory Structure}
122 The ROMIO directory structure consists of two main branches, the MPI-IO branch
123 (mpi-io) and the ADIO branch (adio). The MPI-IO branch contains code that
124 implements the functions defined in the MPI-2 specification for I/O, such as
125 MPI\_File\_open. These functions are then written in terms of other functions
126 that provide an abstract interface to I/O resources, the ADIO functions.
127 There is an additional glue subdirectory in the MPI-IO branch that defines
128 functions related to the MPI implementation as a whole, such as how to
129 allocate MPI\_File structures and how to report errors.
131 Code for the ADIO functions is located under the ADIO branch. This code is
132 responsible for performing I/O operations on whatever underlying storage is
133 available. There are two categories of directories in this branch. The first
134 is the common directory. This directory contains two distinct types of
135 source: source that is used by all ADIO implementations and source that is
136 common across many ADIO implementations. This distinction will become more
137 apparent when we discuss file system implementations.
139 The second category of directory in the ADIO branch is the file system
140 specific directory (e.g. ad\_ufs, ad\_pvfs2). These directories provide code
141 that is specific to a particular file system type and is only built if that
142 file system type is selected at configure time.
144 \section{The Configure Process}
146 ... What can be specified, AIO stuff, where romioconf exists, how to add
147 another Makefile.in into the list.
149 \section{File System Implementations}
151 Each file system implementation exists in its own subdirectory under the adio
152 directory in the source tree. Each of these subdirectories must contain at
153 least two files, a Makefile.in (describing how to build the code in the
154 directory) and a C source file describing the mapping of ADIO operations to C
155 functions.
157 The common practice is to name this file based on the name of the ADIO
158 implementation. In the ad\_ufs implementation this file is called ad\_ufs.c,
159 and contains the following:
161 \begin{verbatim}
162 struct ADIOI_Fns_struct ADIO_UFS_operations = {
163 ADIOI_UFS_Open, /* Open */
164 ADIOI_GEN_ReadContig, /* ReadContig */
165 ADIOI_GEN_WriteContig, /* WriteContig */
166 ADIOI_GEN_ReadStridedColl, /* ReadStridedColl */
167 ADIOI_GEN_WriteStridedColl, /* WriteStridedColl */
168 ADIOI_GEN_SeekIndividual, /* SeekIndividual */
169 ADIOI_GEN_Fcntl, /* Fcntl */
170 ADIOI_GEN_SetInfo, /* SetInfo */
171 ADIOI_GEN_ReadStrided, /* ReadStrided */
172 ADIOI_GEN_WriteStrided, /* WriteStrided */
173 ADIOI_GEN_Close, /* Close */
174 ADIOI_GEN_IreadContig, /* IreadContig */
175 ADIOI_GEN_IwriteContig, /* IwriteContig */
176 ADIOI_GEN_IODone, /* ReadDone */
177 ADIOI_GEN_IODone, /* WriteDone */
178 ADIOI_GEN_IOComplete, /* ReadComplete */
179 ADIOI_GEN_IOComplete, /* WriteComplete */
180 ADIOI_GEN_IreadStrided, /* IreadStrided */
181 ADIOI_GEN_IwriteStrided, /* IwriteStrided */
182 ADIOI_GEN_Flush, /* Flush */
183 ADIOI_GEN_Resize, /* Resize */
184 ADIOI_GEN_Delete, /* Delete */
186 \end{verbatim}
188 The ADIOI\_Fns\_struct structure is defined in adio/include/adioi.h. This
189 structure holds pointers to appropriate functions for a given file system
190 type. "Generic" functions, defined in adio/common, are denoted by the
191 "ADIOI\_GEN" prefix, while file system specific functions use a file system
192 related prefix. In this example, the only file system specific function is
193 ADIOI\_UFS\_Open. All other operations use the generic versions.
195 Typically a third file, a header with file system specific defines and
196 includes, is also provided and named based on the name of the ADIO
197 implementation (e.g. ad\_ufs.h).
199 Because the UFS implementation provides its own open function, that code must be provided in the ad\_ufs subdirectory. That function is implemented in adio/ad\_ufs/ad\_ufs\_open.c.
201 \section{Generic Functions}
203 As we saw in the discussion above, generic ADIO function implementations are
204 used to minimize the amount of code in the ROMIO tree by sharing common
205 functionality between ADIO implementations. As the ROMIO implementation has
206 grown, a few categories of generic implementations have developed. At this
207 time, these are all lumped into the adio/common subdirectory together, which
208 can be confusing.
210 The easiest category of generic functions to understand is the ones that
211 implement functionality in terms of some other ADIO function.
212 ADIOI\_GEN\_ReadStridedColl is a good example of this type of function and is
213 implemented in adio/common/ad\_read\_coll.c. This function implements
214 collective read operations (e.g. MPI\_File\_read\_at\_all). We will discuss how
215 it works later in this document, but for the time being it is sufficient to
216 note that it is written in terms of ADIO ReadStrided or ReadContig calls.
218 A second category of generic functions are ones that implement functionality
219 in terms of POSIX I/O calls. ADIOI\_GEN\_ReadContig (adio/common/ad\_read.c) is
220 a good example of this type of function. These "generic" functions are the
221 result of a large number of ADIO implementations that are largely POSIX I/O
222 based, such as the UFS, XFS, and PANFS implementations. We have discussed
223 moving these functions into a separate common/posix subdirectory and renaming
224 them with ADIOI\_POSIX prefixes, but this has not been done as of the writing
225 of this document.
227 The next category of generic functions holds functions that do not actually
228 require I/O at all. ADIOI\_GEN\_SeekIndividual (adio/common/ad\_seek.c) is a
229 good example of this. Since we don't need to actually perform I/O at seek
230 time, we can just update local variables at each process. In fact, one could
231 argue that we no longer need the ADIO SeekIndividual function at all - all the
232 ADIO implementations simply use this generic version (with the exception of
233 TESTFS, which prints the value as well).
235 The next category of generic functions are the "FAKE" functions (e.g.
236 ADIOI\_FAKE\_IODone implemented in adio/common/ad\_done\_fake.c). These functions
237 are all related to asynchronous I/O (AIO) operations. These implement the AIO
238 operations in terms of blocking operations - in other words, they follow the
239 standard but do not allow for overlap of I/O and computation or communication.
240 These are used in cases where AIO support is otherwise unavailable or
241 unimplemented.
243 The final category of generic functions are the "na�ïve" functions (e.g.
244 ADIOI\_GEN\_WriteStrided\_naive in adio/common/ad\_write\_str\_naive.c). These
245 functions avoid the use of certain optimizations, such as data sieving.
247 Other Things in adio/common
249 ... what else is in there?
251 \subsection{Calling ADIO Functions}
253 Throughout the code you will see calls to functions such as ADIO\_ReadContig.
254 There is no such function - this is actually a macro defined in
255 adio/include/adioi.h that calls the particular function out of the correct
256 ADIOI\_Fns\_struct for the file being accessed. This is done for convenience.
258 Exceptions!!! ADIO\_Open, ADIO\_Close...
260 \section{ROMIO Implementation Details}
262 The ROMIO Implementation relies on some basic concepts in order to operate and
263 to optimize I/O access. In this section we will discuss these concepts and
264 how they are implemented within ROMIO. Before we do that though, we will
265 discuss the core data structure of ROMIO, the ADIO\_File structure.
267 \subsection{ADIO\_File}
269 ... discussion ...
271 \subsection{I/O Aggregation and Aggregators}
273 When performing collective I/O operations, it is often to our advantage to
274 combine operations or eliminate redundant operations altogether. We call this
275 combining process "aggregation", and processes that perform these combined
276 operations aggregators.
278 Aggregators are defined at the time the file is opened. A collection of MPI
279 hints can be used to tune what processes become aggregators for a given file
280 (see ROMIO User's Guide). The aggregators will then interact with the file
281 system during collective operations.
283 Note that it is possible to implement a system where ALL I/O operations pass
284 exclusively through aggregators, including independent I/O operations from
285 non-aggregators. However, this would require a guarantee of progress from the
286 aggregators that for portability would mean adding a thread to manage I/O. We
287 have chosen not to pursue this path at this time, so independent operations
288 continue to be serviced by the process making the call.
290 ... how implemented ...
292 Rank 0 in the communicator opening a file \emph{always} processes the
293 cb\_config\_list hint using ADIOI\_cb\_config\_list\_parse. A previous call to
294 ADIOI\_cb\_gather\_name\_array had collected the processor names from all hosts
295 into an array that is cached on the communicator (so we don't have to gather
296 it more than once). This creates an ordered array of ranks (relative to the
297 communicator used to open the file) that will be aggregators. This array is
298 distributed to all processes using ADIOI\_cb\_bcast\_rank\_map. Aggregators are
299 referenced by their rank in the communicator used to open the file. These
300 ranks are stored in fd->hints->ranklist[].
302 Note that this could be a big list for very large runs. If we were to
303 restrict aggregators to a rank order subset, we could use a bitfield instead.
305 If the user specified hints and met conditions for deferred open, then a
306 separate communicator is also set up (fd->agg\_comm) that contains all the
307 aggregators, in order of their original ranks (not their order in the rank
308 list). Otherwise this communicator is set to MPI\_COMM\_NULL, and in any case
309 it is set to this for non-aggregators. This communicator is currently only
310 used at ADIO\_Close (adio/common/ad\_close.c), but could be useful in two-phase
311 I/O as well (discussed later).
314 \subsection{Deferred Open}
316 We do not always want all processes to attempt to actually open a file when
317 MPI\_File\_open is called. We might want to avoid this open because in fact
318 some processes (non-aggregators) cannot access the file at all and would get
319 an error, or we might want to avoid this open to avoid a storm of system calls
320 hitting the file system all at once. In either case, ROMIO implements a
321 "deferred open" mode that allows some processes to avoid opening the file
322 until such time as they perform an independent I/O operation on the file (see
323 ROMIO User's Guide).
325 Deferred open has a broad impact on the ROMIO implementation, because with its
326 addition there are now many places where we must first check to see if we have
327 called the file system specific ADIO Open call before performing I/O. This
328 impact is limited to the MPI-IO layer by semantically guaranteeing the FS ADIO
329 Open call has been made by the process prior to calling a read or write
330 function.
332 ... how implemented ...
334 \subsection{Two-Phase I/O}
336 Two-Phase I/O is a technique for increasing the efficiency of I/O operations
337 by reordering data between processes, either before writes, or after reads.
339 ROMIO implements two-phase I/O as part of the generic implementations of
340 ADIO\_WriteStridedColl and ADIO\_ReadStridedColl. These implementations in turn
341 rely heavily on the aggregation code to determine what processes will actually
342 perform I/O on behalf of the application as a whole.
346 \subsection{Data Sieving}
348 Data sieving is a single-process technique for reducing the number of I/O
349 operations used to service a MPI read or write operation by accessing a
350 contiguous region of the file that contains more than one desired region at
351 once. Because often I/O operations require data movement across the network,
352 this is usually a more efficient way to access data.
354 Data sieving is implemented in the common strided I/O routines
355 (adio/common/ad\_write\_str.c and adio/common/ad\_read\_str.c). These functions
356 use the contig read and write routines to perform actual I/O. In the case of
357 a write operation, a read/modify/write sequence is used. In that case, as
358 well as in the atomic mode case, locking is required on the region. Some of
359 the ADIO implementations do not currently support locking, and in those cases
360 it would be erroneous to use the generic strided I/O routines.
362 \subsection{Shared File Pointers}
364 Because no file systems supported by ROMIO currently support a shared file
365 pointer mode, ROMIO must implement shared file pointers under the covers on
366 its own.
368 Currently ROMIO implements shared file pointers by storing the file pointer
369 value in a separate file...
371 Note that the ROMIO team has devised a portable method for implementing shared
372 file pointers using only MPI-1 and MPI-2 functions. However, this method has
373 not yet been implemented in ROMIO.
375 file name is selected at end of mpi-io/open.c.
377 \subsection{Error Handling}
379 \subsection{MPI and MPIO Requests}
381 \section*{Appendix A: ADIO Functions and Semantics}
383 ADIOI\_Open(ADIO\_File fd, int *error\_code)
385 Open is used in a strange way in ROMIO, as described previously.
387 The Open function is used to perform whatever operations are necessary prior
388 to actually accessing a file using read or write. The file name for the file
389 is stored in fd->filename prior to Open being called.
391 Note that when deferred open is in effect, all processes may not immediately
392 call Open at MPI\_File\_open time, but instead call open if they perform
393 independent I/O. This can result in somewhat unusual error returns to
394 processes (e.g. learning that a file is not accessible at write time).
396 ADIOI\_ReadContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
397 int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
399 ReadContig is used to read a contiguous region from a file into a contiguous
400 buffer. The datatype (which refers to the buffer) can be assumed to be
401 contiguous. The offset is in bytes and is an absolute offset if
402 ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
403 current individual file pointer if ADIO\_INDIVIDUAL was passed as
404 file\_ptr\_type. Open has been called by this process prior to the call to
405 ReadContig. There is no guarantee that any other processes will call this
406 function at the same time.
408 ADIOI\_WriteContig(ADIO\_File fd, void *buf, int count, MPI\_Datatype datatype,
409 int file\_ptr\_type, ADIO\_Offset offset, ADIO\_Status *status, int *error\_code)
411 WriteContig is used to write a contiguous region to a file from a contiguous
412 buffer. The datatype (which refers to the buffer) can be assumed to be
413 contiguous. The offset is in bytes and is an absolute offset if
414 ADIO\_EXPLICIT\_OFFSET was passed as the file\_ptr\_type or relative to the
415 current individual file pointer if ADIO\_INDIVIDUAL was passed as
416 file\_ptr\_type. Open has been called by this process prior to the call to
417 WriteContig. There is no guarantee that any other processes will call this
418 function at the same time.
420 ADIOI\_ReadStridedColl
422 ADIOI\_WriteStridedColl
424 ADIOI\_SeekIndividual
426 ADIOI\_Fcntl
428 ADIOI\_SetInfo
430 ADIOI\_ReadStrided
432 ADIOI\_WriteStrided
434 ADIOI\_Close(ADIO\_File fd, int *error\_code)
436 Close is responsible for releasing any resources associated with an open file.
437 It is called on all processes that called the corresponding ADIOI Open, which
438 might not be all the processes that opened the file (due to deferred open).
439 Thus it is not safe to perform collective communication among all processes in
440 the communicator during Close, although collective communication between
441 aggregators would be safe (if desired).
443 For performance reasons ROMIO does not guarantee that all file data is written
444 to "storage" at MPI\_File\_close, instead only performing synchronization
445 operations at MPI\_File\_sync time. As a result, our Close implementations do
446 not typically call a sync. However, any locally cached data, if any, should
447 be passed on to the underlying storage system at this time.
449 Note that ADIOI\_GEN\_Close is implemented in adio/common/adi\_close.c;
450 ad\_close.c implements ADIO\_Close, which is called by all processes that opened
451 the file.
453 ADIOI\_IreadContig
455 ADIOI\_IwriteContig
457 ADIOI\_ReadDone
459 ADIOI\_WriteDone
461 ADIOI\_ReadComplete
463 ADIOI\_WriteComplete
465 ADIOI\_IreadStrided
467 ADIOI\_IwriteStrided
469 ADIOI\_Flush
471 ADIOI\_Resize(ADIO\_File fd, ADIO\_Offset size, int *error\_code)
473 Resize is called collectively by all processes that opened the file referenced
474 by fd. It is not required that the Resize implementation block until all
475 processes have completed resize operations, but each process should be able to
476 see the correct size with a corresponding MPI\_File\_get\_size operation (an
477 independent operation that results in an ADIO Fcntl to obtain the file size).
479 ADIOI\_Delete(char *filename, int *error\_code)
481 Delete is called independently, and because only a filename is passed, there
482 is no opportunity to coordinate deletion if an application were to choose to
483 have all processes call MPI\_File\_delete. That's not likely to be an issue
484 though.
486 \section*{Appendix B: Status of ADIO Implementations}
488 ... who wrote what, status, etc.
490 Appendix C: Adding a New ADIO Implementation
492 References
494 \end{document}