Bug #757: disable unnecassary network polling in multicore build
[charm.git] / doc / charm++ / checkpoint.tex
blob5dbd9f1c001cd3ae6d423c2f7876f55b44014cff
1 \charmpp{} offers a couple of checkpoint/restart mechanisms. Each
2 of these targets a specific need in parallel programming. However,
3 both of them are based on the same infrastructure.
5 Traditional chare-array-based \charmpp{} applications, including AMPI
6 applications, can be checkpointed to storage buffers (either files or
7 memory regions) and be restarted later from those buffers. The basic
8 idea behind this is straightforward: checkpointing an application is
9 like migrating its parallel objects from the processors onto buffers,
10 and restarting is the reverse. Thanks to the migration utilities like
11 PUP methods (Section~\ref{sec:pup}), users can decide what data to
12 save in checkpoints and how to save them. However, unlike migration
13 (where certain objects do not need a PUP method), checkpoint requires
14 all the objects to implement the PUP method.
16 The two checkpoint/restart schemes implemented are:
17 \begin{itemize}
18 \item Shared filesystem: provides support for \emph{split execution}, where the
19 execution of an application is interrupted and later resumed.
20 \item Double local-storage: offers an online \emph{fault
21 tolerance} mechanism for applications running on unreliable machines.
22 \end{itemize}
24 \section{Split Execution}
26 There are several reasons for having to split the execution of an
27 application. These include protection against job failure, a single
28 execution needing to run beyond a machine's job time limit, and
29 resuming execution from an intermediate point with different
30 parameters. All of these scenarios are supported by a mechanism to
31 record execution state, and resume execution from it later.
33 Parallel machines are assembled from many complicated components, each
34 of which can potentially fail and interrupt execution
35 unexpectedly. Thus, parallel applications that take long enough to run
36 from start to completion need to protect themselves from losing work
37 and having to start over. They can achieve this by periodically taking
38 a checkpoint of their execution state from which they can later
39 resume.
41 Another use of checkpoint/restart is where the total execution
42 time of the application exceeds the maximum allocation time for a job
43 in a supercomputer. For that case, an application may checkpoint
44 before the allocation time expires and then restart from the
45 checkpoint in a subsequent allocation.
47 A third reason for having a split execution is when an application
48 consists in \emph{phases} and each phase may be run a different number
49 of times with varying parameters. Consider, for instance, an
50 application with two phases where the first phase only has a possible
51 configuration (it is run only once). The second phase may have several
52 configuration (for testing various algorithms). In that case, once the
53 first phase is complete, the application checkpoints the
54 result. Further executions of the second phase may just resume from
55 that checkpoint.
57 An example of \charmpp{}'s support for split execution can be seen
58 in \testrefdir{chkpt/hello}.
60 \subsection{Checkpointing}
62 \label{sec:diskcheckpoint}
63 The API to checkpoint the application is:
65 \begin{alltt}
66 void CkStartCheckpoint(char* dirname,const CkCallback& cb);
67 \end{alltt}
69 The string {\it dirname} is the destination directory where the
70 checkpoint files will be stored, and {\it cb} is the callback function
71 which will be invoked after the checkpoint is done, as well as when
72 the restart is complete. Here is an example of a typical use:
74 \begin{alltt}
75 . . . CkCallback cb(CkIndex_Hello::SayHi(),helloProxy);
76 CkStartCheckpoint("log",cb);
77 \end{alltt}
79 A chare array usually has a PUP routine for the sake of migration.
80 The PUP routine is also used in the checkpointing and restarting
81 process. Therefore, it is up to the programmer what to save and
82 restore for the application. One illustration of this flexbility is a
83 complicated scientific computation application with 9 matrices, 8 of
84 which hold intermediate results and 1 that holds the final results
85 of each timestep. To save resources, the PUP routine can well omit
86 the 8 intermediate matrices and checkpoint the matrix with the final
87 results of each timestep.
89 Group, nodegroup (Section~\ref{sec:group}) and singleton chare objects are
90 normally not meant to be migrated. In order to checkpoint them, however, the
91 user has to write PUP routines for the groups and chare and declare them as {\tt
92 [migratable]} in the .ci file. Some programs use {\it mainchares} to
93 hold key control data like global object counts, and thus mainchares
94 need to be checkpointed too. To do this, the programmer should write a
95 PUP routine for the mainchare and declare them as {\tt [migratable]}
96 in the .ci file, just as in the case of Group and NodeGroup.
98 The checkpoint must be recorded at a synchronization point in the
99 application, to ensure a consistent state upon restart. One easy way
100 to achieve this is to synchronize through a reduction to a single
101 chare (such as the mainchare used at startup) and have that chare make
102 the call to initiate the checkpoint.
104 After {\tt CkStartCheckpoint} is executed, a directory of the
105 designated name is created and a collection of checkpoint files are
106 written into it.
108 \subsection{Restarting}
110 The user can choose to run the \charmpp{} application in restart mode,
111 i.e., restarting execution from a previously-created checkpoint. The
112 command line option {\tt +restart DIRNAME} is required to invoke this
113 mode. For example:
115 \begin{alltt}
116 > ./charmrun hello +p4 +restart log
117 \end{alltt}
119 Restarting is the reverse process of checkpointing. \charmpp{} allows
120 restarting the old checkpoint on a different number of physical
121 processors. This provides the flexibility to expand or shrink your
122 application when the availability of computing resources changes.
124 Note that on restart, if an array or group reduction client was set to a static
125 function, the function pointer might be lost and the user needs to
126 register it again. A better alternative is to always use an entry method
127 of a chare object. Since all the entry methods are registered
128 inside \charmpp{} system, in the restart phase, the reduction client
129 will be automatically restored.
131 After a failure, the system may contain fewer or more
132 processors. Once the failed components have been repaired, some
133 processors may become available again. Therefore, the user may need
134 the flexibility to restart on a different number of processors than in
135 the checkpointing phase. This is allowable by giving a different {\tt
136 +pN} option at runtime. One thing to note is that the new load
137 distribution might differ from the previous one at checkpoint time, so
138 running a load balancer (see Section~\ref{loadbalancing}) after
139 restart is suggested.
141 If restart is not done on the same number of processors, the
142 processor-specific data in a group/nodegroup branch cannot (and
143 usually should not) be restored individually. A copy from processor 0
144 will be propagated to all the processors.
146 \subsection{Choosing What to Save}
148 In your programs, you may use chare groups for different types of
149 purposes. For example, groups holding read-only data can avoid
150 excessive data copying, while groups maintaining processor-specific
151 information are used as a local manager of the processor. In the
152 latter situation, the data is sometimes too complicated to save and
153 restore but easy to re-compute. For the read-only data, you want to
154 save and restore it in the PUP'er routine and leave empty the
155 migration constructor, via which the new object is created during
156 restart. For the easy-to-recompute type of data, we just omit the
157 PUP'er routine and do the data reconstruction in the group's migration
158 constructor.
160 A similar example is the program mentioned above, where there are two
161 types of chare arrays, one maintaining intermediate results while the
162 other type holds the final result for each timestep. The programmer
163 can take advantage of the flexibility by leaving PUP'er routine empty
164 for intermediate objects, and do save/restore only for the important
165 objects.
167 \section{Online Fault Tolerance}
168 \label{sec:MemCheckpointing}
169 As supercomputers grow in size, their reliability decreases
170 correspondingly. This is due to the fact that the ability to assemble
171 components in a machine surpasses the increase in reliability per
172 component. What we can expect in the future is that applications will
173 run on unreliable hardware.
175 The previous disk-based checkpoint/restart can be used as a fault
176 tolerance scheme. However, it would be a very basic scheme in that
177 when a failure occurs, the whole program gets killed and the user has
178 to manually restart the application from the checkpoint files. The
179 double local-storage checkpoint/restart protocol described in this
180 subsection provides an automatic fault tolerance solution. When a
181 failure occurs, the program can automatically detect the failure and
182 restart from the checkpoint. Further, this fault-tolerance protocol
183 does not rely on any reliable external storage (as needed in the previous
184 method). Instead, it stores two copies of checkpoint data to two
185 different locations (can be memory or local disk). This double
186 checkpointing ensures the availability of one checkpoint in case the
187 other is lost. The double in-memory checkpoint/restart scheme is
188 useful and efficient for applications with small memory footprint at
189 the checkpoint state. The double in-disk variant stores checkpoints
190 into local disk, thus can be useful for applications with large memory
191 footprint.
192 %Its advantage is to reduce the recovery
193 %overhead to seconds when a failure occurs.
194 %Currently, this scheme only supports Chare array-based Charm++ applications.
197 \subsection{Checkpointing}
199 The function that application developers can call to record a checkpoint in a
200 chare-array-based application is:
201 \begin{alltt}
202 void CkStartMemCheckpoint(CkCallback &cb)
203 \end{alltt}
204 where {\it cb} has the same meaning as in
205 section~\ref{sec:diskcheckpoint}. Just like the above disk checkpoint
206 described, it is up to the programmer to decide what to save. The
207 programmer is responsible for choosing when to activate checkpointing
208 so that the size of a global checkpoint state, and consequently the
209 time to record it, is minimized.
211 In AMPI applications, the user just needs to call the following
212 function to record a checkpoint:
213 \begin{alltt}
214 void AMPI_MemCheckpoint()
215 \end{alltt}
217 \subsection{Restarting}
219 When a processor crashes, the restart protocol will be automatically
220 invoked to recover all objects using the last checkpoints. The program
221 will continue to run on the surviving processors. This is based on the
222 assumption that there are no extra processors to replace the crashed
223 ones.
225 However, if there are a pool of extra processors to replace the
226 crashed ones, the fault-tolerance protocol can also take advantage of
227 this to grab one free processor and let the program run on the same
228 number of processors as before the crash. In order to achieve
229 this, \charmpp{} needs to be compiled with the macro option {\it
230 CK\_NO\_PROC\_POOL} turned on.
232 \subsection{Double in-disk checkpoint/restart}
234 A variant of double memory checkpoint/restart, {\it double in-disk
235 checkpoint/restart}, can be applied to applications with large memory
236 footprint. In this scheme, instead of storing checkpoints in the
237 memory, it stores them in the local disk. The checkpoint files are
238 named ``ckpt[CkMyPe]-[idx]-XXXXX'' and are stored under the /tmp
239 directory.
241 Users can pass the runtime option {\it +ftc\_disk} to activate this
242 mode. For example:
244 \begin{alltt}
245 ./charmrun hello +p8 +ftc_disk
246 \end{alltt}
248 \subsection{Building Instructions}
249 In order to have the double local-storage checkpoint/restart
250 functionality available, the parameter \emph{syncft} must be provided
251 at build time:
253 \begin{alltt}
254 ./build charm++ net-linux-x86_64 syncft
255 \end{alltt}
257 At present, only a few of the machine layers underlying the \charmpp{}
258 runtime system support resilient execution. These include the
259 TCP-based \texttt{net} builds on Linux and Mac OS X. For clusters overbearing
260 job-schedulers that kill a job if a node goes down, the way to demonstrate the killing
261 of a process is show in Section~\ref{ft:inject} .
262 \charmpp{} runtime system can automatically detect failures and restart from checkpoint.
264 \subsection{Failure Injection}
265 To test that your application is able to successfully recover from
266 failures using the double local-storage mechanism, we provide a
267 failure injection mechanism that lets you specify which PEs will fail
268 at what point in time. You must create a text file with two
269 columns. The first colum will store the PEs that will fail. The second
270 column will store the time at which the corresponding PE will
271 fail. Make sure all the failures occur after the first checkpoint. The
272 runtime parameter \emph{kill\_file} has to be added to the command
273 line along with the file name:
275 \begin{alltt}
276 ./charmrun hello +p8 +kill_file <file>
277 \end{alltt}
279 An example of this usage can be found in the \texttt{syncfttest}
280 targets in \testrefdir{jacobi3d}.
282 \subsection{Failure Demonstration}
283 \label{ft:inject}
284 For HPC clusters, the job-schedulers usually kills a job if a node goes down. To demonstrate
285 restarting after failures on such clusters, \texttt{CkDieNow()} function can be used. You just need to place it at any place
286 in the code. When it is called by a processor, the processor will hang and stop responding to any communicataion.
287 A spare processor will replace the crashed processor and continue execution after getting the checkpoint of the crashed processor.
288 To make it work, you need to add the command line option
289 \emph{+wp}, the number following that option is the working processors and the remaining
290 are the spare processors in the system.