doc/charm++/checkpoint.tex

   1 \charmpp{} offers a couple of checkpoint/restart mechanisms. Each
   2 of these targets a specific need in parallel programming. However,
   3 both of them are based on the same infrastructure.
   4
   5 Traditional chare-array-based \charmpp{} applications, including AMPI
   6 applications, can be checkpointed to storage buffers (either files or
   7 memory regions) and be restarted later from those buffers. The basic
   8 idea behind this is straightforward: checkpointing an application is
   9 like migrating its parallel objects from the processors onto buffers,
  10 and restarting is the reverse. Thanks to the migration utilities like
  11 PUP methods (Section~\ref{sec:pup}), users can decide what data to
  12 save in checkpoints and how to save them. However, unlike migration
  13 (where certain objects do not need a PUP method), checkpoint requires
  14 all the objects to implement the PUP method.
  15
  16 The two checkpoint/restart schemes implemented are:
  17 \begin{itemize}
  18 \item Shared filesystem: provides support for \emph{split execution}, where the
  19 execution of an application is interrupted and later resumed.
  20 \item Double local-storage: offers an online \emph{fault
  21 tolerance} mechanism for applications running on unreliable machines.
  22 \end{itemize}
  23
  24 \section{Split Execution}
  25
  26 There are several reasons for having to split the execution of an
  27 application. These include protection against job failure, a single
  28 execution needing to run beyond a machine's job time limit, and
  29 resuming execution from an intermediate point with different
  30 parameters. All of these scenarios are supported by a mechanism to
  31 record execution state, and resume execution from it later.
  32
  33 Parallel machines are assembled from many complicated components, each
  34 of which can potentially fail and interrupt execution
  35 unexpectedly. Thus, parallel applications that take long enough to run
  36 from start to completion need to protect themselves from losing work
  37 and having to start over. They can achieve this by periodically taking
  38 a checkpoint of their execution state from which they can later
  39 resume.
  40
  41 Another use of checkpoint/restart is where the total execution
  42 time of the application exceeds the maximum allocation time for a job
  43 in a supercomputer. For that case, an application may checkpoint
  44 before the allocation time expires and then restart from the
  45 checkpoint in a subsequent allocation.
  46
  47 A third reason for having a split execution is when an application
  48 consists in \emph{phases} and each phase may be run a different number
  49 of times with varying parameters. Consider, for instance, an
  50 application with two phases where the first phase only has a possible
  51 configuration (it is run only once). The second phase may have several
  52 configuration (for testing various algorithms). In that case, once the
  53 first phase is complete, the application checkpoints the
  54 result. Further executions of the second phase may just resume from
  55 that checkpoint.
  56
  57 An example of \charmpp{}'s support for split execution can be seen
  58 in \testrefdir{chkpt/hello}.
  59
  60 \subsection{Checkpointing}
  61
  62 \label{sec:diskcheckpoint}
  63         The API to checkpoint the application is:
  64
  65 \begin{alltt}
  66   void CkStartCheckpoint(char* dirname,const CkCallback& cb);
  67 \end{alltt}
  68
  69 The string {\it dirname} is the destination directory where the
  70 checkpoint files will be stored, and {\it cb} is the callback function
  71 which will be invoked after the checkpoint is done, as well as when
  72 the restart is complete. Here is an example of a typical use:
  73
  74 \begin{alltt}
  75   . . .  CkCallback cb(CkIndex_Hello::SayHi(),helloProxy);
  76   CkStartCheckpoint("log",cb);
  77 \end{alltt}
  78
  79 A chare array usually has a PUP routine for the sake of migration.
  80 The PUP routine is also used in the checkpointing and restarting
  81 process.  Therefore, it is up to the programmer what to save and
  82 restore for the application. One illustration of this flexbility is a
  83 complicated scientific computation application with 9 matrices, 8 of
  84 which hold intermediate results and 1 that holds the final results
  85 of each timestep.  To save resources, the PUP routine can well omit
  86 the 8 intermediate matrices and checkpoint the matrix with the final
  87 results of each timestep.
  88
  89 Group, nodegroup (Section~\ref{sec:group}) and singleton chare objects are
  90 normally not meant to be migrated. In order to checkpoint them, however, the
  91 user has to write PUP routines for the groups and chare and declare them as {\tt
  92 [migratable]} in the .ci file. Some programs use {\it mainchares} to
  93 hold key control data like global object counts, and thus mainchares
  94 need to be checkpointed too. To do this, the programmer should write a
  95 PUP routine for the mainchare and declare them as {\tt [migratable]}
  96 in the .ci file, just as in the case of Group and NodeGroup.
  97
  98 The checkpoint must be recorded at a synchronization point in the
  99 application, to ensure a consistent state upon restart. One easy way
 100 to achieve this is to synchronize through a reduction to a single
 101 chare (such as the mainchare used at startup) and have that chare make
 102 the call to initiate the checkpoint.
 103
 104 After {\tt CkStartCheckpoint} is executed, a directory of the
 105 designated name is created and a collection of checkpoint files are
 106 written into it.
 107
 108 \subsection{Restarting}
 109
 110 The user can choose to run the \charmpp{} application in restart mode,
 111 i.e., restarting execution from a previously-created checkpoint. The
 112 command line option {\tt +restart DIRNAME} is required to invoke this
 113 mode. For example:
 114
 115 \begin{alltt}
 116   > ./charmrun hello +p4 +restart log
 117 \end{alltt}
 118
 119 Restarting is the reverse process of checkpointing. \charmpp{} allows
 120 restarting the old checkpoint on a different number of physical
 121 processors.  This provides the flexibility to expand or shrink your
 122 application when the availability of computing resources changes.
 123
 124 Note that on restart, if an array or group reduction client was set to a static
 125 function, the function pointer might be lost and the user needs to
 126 register it again. A better alternative is to always use an entry method
 127 of a chare object. Since all the entry methods are registered
 128 inside \charmpp{} system, in the restart phase, the reduction client
 129 will be automatically restored.
 130
 131 After a failure, the system may contain fewer or more
 132 processors. Once the failed components have been repaired, some
 133 processors may become available again. Therefore, the user may need
 134 the flexibility to restart on a different number of processors than in
 135 the checkpointing phase. This is allowable by giving a different {\tt
 136 +pN} option at runtime. One thing to note is that the new load
 137 distribution might differ from the previous one at checkpoint time, so
 138 running a load balancer (see Section~\ref{loadbalancing}) after
 139 restart is suggested.
 140
 141 If restart is not done on the same number of processors, the
 142 processor-specific data in a group/nodegroup branch cannot (and
 143 usually should not) be restored individually. A copy from processor 0
 144 will be propagated to all the processors.
 145
 146 \subsection{Choosing What to Save}
 147
 148 In your programs, you may use chare groups for different types of
 149 purposes.  For example, groups holding read-only data can avoid
 150 excessive data copying, while groups maintaining processor-specific
 151 information are used as a local manager of the processor. In the
 152 latter situation, the data is sometimes too complicated to save and
 153 restore but easy to re-compute. For the read-only data, you want to
 154 save and restore it in the PUP'er routine and leave empty the
 155 migration constructor, via which the new object is created during
 156 restart.  For the easy-to-recompute type of data, we just omit the
 157 PUP'er routine and do the data reconstruction in the group's migration
 158 constructor.
 159
 160 A similar example is the program mentioned above, where there are two
 161 types of chare arrays, one maintaining intermediate results while the
 162 other type holds the final result for each timestep. The programmer
 163 can take advantage of the flexibility by leaving PUP'er routine empty
 164 for intermediate objects, and do save/restore only for the important
 165 objects.
 166
 167 \section{Online Fault Tolerance}
 168 \label{sec:MemCheckpointing}
 169 As supercomputers grow in size, their reliability decreases
 170 correspondingly. This is due to the fact that the ability to assemble
 171 components in a machine surpasses the increase in reliability per
 172 component. What we can expect in the future is that applications will
 173 run on unreliable hardware.
 174
 175 The previous disk-based checkpoint/restart can be used as a fault
 176 tolerance scheme. However, it would be a very basic scheme in that
 177 when a failure occurs, the whole program gets killed and the user has
 178 to manually restart the application from the checkpoint files.  The
 179 double local-storage checkpoint/restart protocol described in this
 180 subsection provides an automatic fault tolerance solution. When a
 181 failure occurs, the program can automatically detect the failure and
 182 restart from the checkpoint.  Further, this fault-tolerance protocol
 183 does not rely on any reliable external storage (as needed in the previous
 184 method).  Instead, it stores two copies of checkpoint data to two
 185 different locations (can be memory or local disk).  This double
 186 checkpointing ensures the availability of one checkpoint in case the
 187 other is lost.  The double in-memory checkpoint/restart scheme is
 188 useful and efficient for applications with small memory footprint at
 189 the checkpoint state.  The double in-disk variant stores checkpoints
 190 into local disk, thus can be useful for applications with large memory
 191 footprint.
 192 %Its advantage is to reduce the recovery
 193 %overhead to seconds when a failure occurs.
 194 %Currently, this scheme only supports Chare array-based Charm++ applications.
 195
 196
 197 \subsection{Checkpointing}
 198
 199 The function that application developers can call to record a checkpoint in a
 200 chare-array-based application is:
 201 \begin{alltt}
 202       void CkStartMemCheckpoint(CkCallback &cb)
 203 \end{alltt}
 204 where {\it cb} has the same meaning as in
 205 section~\ref{sec:diskcheckpoint}.  Just like the above disk checkpoint
 206 described, it is up to the programmer to decide what to save.  The
 207 programmer is responsible for choosing when to activate checkpointing
 208 so that the size of a global checkpoint state, and consequently the
 209 time to record it, is minimized.
 210
 211 In AMPI applications, the user just needs to call the following
 212 function to record a checkpoint:
 213 \begin{alltt}
 214       void AMPI_MemCheckpoint()
 215 \end{alltt}
 216
 217 \subsection{Restarting}
 218
 219 When a processor crashes, the restart protocol will be automatically
 220 invoked to recover all objects using the last checkpoints. The program
 221 will continue to run on the surviving processors. This is based on the
 222 assumption that there are no extra processors to replace the crashed
 223 ones.
 224
 225 However, if there are a pool of extra processors to replace the
 226 crashed ones, the fault-tolerance protocol can also take advantage of
 227 this to grab one free processor and let the program run on the same
 228 number of processors as before the crash.  In order to achieve
 229 this, \charmpp{} needs to be compiled with the macro option {\it
 230 CK\_NO\_PROC\_POOL} turned on.
 231
 232 \subsection{Double in-disk checkpoint/restart}
 233
 234 A variant of double memory checkpoint/restart, {\it double in-disk
 235 checkpoint/restart}, can be applied to applications with large memory
 236 footprint.  In this scheme, instead of storing checkpoints in the
 237 memory, it stores them in the local disk.  The checkpoint files are
 238 named ``ckpt[CkMyPe]-[idx]-XXXXX'' and are stored under the /tmp
 239 directory.
 240
 241 Users can pass the runtime option {\it +ftc\_disk} to activate this
 242 mode.  For example:
 243
 244 \begin{alltt}
 245    ./charmrun hello +p8 +ftc_disk
 246 \end{alltt}
 247
 248 \subsection{Building Instructions}
 249 In order to have the double local-storage checkpoint/restart
 250 functionality available, the parameter \emph{syncft} must be provided
 251 at build time:
 252
 253 \begin{alltt}
 254    ./build charm++ net-linux-x86_64 syncft
 255 \end{alltt}
 256
 257 At present, only a few of the machine layers underlying the \charmpp{}
 258 runtime system support resilient execution. These include the
 259 TCP-based \texttt{net} builds on Linux and Mac OS X. For clusters overbearing
 260 job-schedulers that kill a job if a node goes down, the way to demonstrate the killing
 261 of a process is show in Section~\ref{ft:inject} .
 262 \charmpp{} runtime system can automatically detect failures and restart from checkpoint.
 263
 264 \subsection{Failure Injection}
 265 To test that your application is able to successfully recover from
 266 failures using the double local-storage mechanism, we provide a
 267 failure injection mechanism that lets you specify which PEs will fail
 268 at what point in time. You must create a text file with two
 269 columns. The first colum will store the PEs that will fail. The second
 270 column will store the time at which the corresponding PE will
 271 fail. Make sure all the failures occur after the first checkpoint. The
 272 runtime parameter \emph{kill\_file} has to be added to the command
 273 line along with the file name:
 274
 275 \begin{alltt}
 276    ./charmrun hello +p8 +kill_file <file>
 277 \end{alltt}
 278
 279 An example of this usage can be found in the \texttt{syncfttest}
 280 targets in \testrefdir{jacobi3d}.
 281
 282 \subsection{Failure Demonstration}
 283 \label{ft:inject}
 284 For HPC clusters, the job-schedulers usually kills a job if a node goes down. To demonstrate
 285 restarting after failures on such clusters, \texttt{CkDieNow()} function can be used. You just need to place it at any place
 286 in the code. When it is called by a processor, the processor will hang and stop responding to any communicataion.
 287 A spare processor will replace the crashed processor and continue execution after getting the checkpoint of the crashed processor.
 288 To make it work, you need to add the command line option
 289 \emph{+wp}, the number following that option is the working processors and the remaining
 290 are the spare processors in the system.