Resolve build failure due to inttypes.h inclusion in hrctimer.h
[charm.git] / doc / libraries / tram.tex
bloba11beead95cb5cb8dc7c18bfef301d96525ac289
1 \newcommand{\code}[1]{\textsf{#1}}
2 \label{TRAM}
3 \section{Overview}
5 Topological Routing and Aggregation Module is a library for optimization of
6 many-to-many and all-to-all collective communication patterns in Charm++
7 applications. The library performs topological routing and aggregation of
8 network communication in the context of a virtual grid topology comprising the
9 Charm++ Processing Elements (PEs) in the parallel run. The number of dimensions
10 and their sizes within this topology are specified by the user when initializing
11 an instance of the library.
13 TRAM is implemented as a Charm++ group, so an \emph{instance}
14 of TRAM has one object on every PE used in the run. We use
15 the term \emph{local instance} to denote a member of the TRAM
16 group on a particular PE.
18 Most collective communication patterns involve sending linear arrays
19 of a single data type. In order to more efficiently aggregate and
20 process data, TRAM restricts the data sent using the
21 library to a single data type specified by the user through a template
22 parameter when initializing an instance of the library. We use the
23 term \emph{data item} to denote a single object of this datatype
24 submitted to the library for sending. While the library is active
25 (i.e. after initialization and before termination), an arbitrary
26 number of data items can be submitted to the library at each PE.
28 On systems with an underlying grid or torus network topology, it can be
29 beneficial to configure the virtual topology for TRAM to match the physical
30 topology of the network. This can easily be accomplished using the Charm++
31 Topology Manager.
33 The next two sections explain the routing and aggregation techniques
34 used in the library.
36 \subsection{Routing}
38 Let the variables $j$ and $k$ denote PEs within an N-dimensional
39 virtual topology of PEs and $x$ denote a dimension of the grid. We
40 represent the coordinates of $j$ and $k$ within the grid as $\left
41 (j_0, j_1, \ldots, j_{N-1} \right) $ and $ \left (k_0, k_1, \ldots,
42 k_{N-1} \right) $. Also, let
44 f(x, j, k) =
45 \begin{cases}
46 0, & \text{if } j_x = k_x \\
47 1, & \text{if } j_x \ne k_x
48 \end{cases}
51 $j$ and $k$ are \emph{peers} if
53 \sum_{d=0}^{N-1} f(d, j, k) = 1 .
55 When using TRAM, PEs communicate directly only with their
56 peers. Sending to a PE which is not a peer is handled inside the
57 library by routing the data through one or more \emph{intermediate
58 destinations} along the route to the \emph{final destination}.
60 Suppose a data item destined for PE $k$ is submitted to the library at
61 PE $j$. If $k$ is a peer of $j$, the data item will be sent directly
62 to $k$, possibly along with other data items for which $k$ is the
63 final or intermediate destination. If $k$ is not a peer of $j$, the
64 data item will be sent to an intermediate destination $m$ along the
65 route to $k$ whose index is $\left (j_0, j_1, \ldots, j_{i-1}, k_i,
66 j_{i+1}, \ldots, j_{N-1} \right)$, where $i$ is the greatest value of
67 $x$ for which $f(x, j, k) = 1$.
69 Note that in obtaining the coordinates of $m$ from $j$, exactly one of
70 the coordinates of $j$ which differs from the coordinates of $k$ is
71 made to agree with $k$. It follows that m is a peer of $j$, and that
72 using this routing process at $m$ and every subsequent intermediate
73 destination along the route eventually leads to the data item being
74 received at $k$. Consequently, the number of messages $F(j, k)$ that
75 will carry the data item to the destination is
77 F(j,k) = \sum_{d=0}^{N-1}f(d, j, k) .
80 \subsection{Aggregation}
82 Communicating over the network of a parallel machine involves per
83 message bandwidth and processing overhead. TRAM amortizes
84 this overhead by aggregating data items at the source and every
85 intermediate destination along the route to the final destination.
87 Every local instance of the TRAM group buffers the data items that have been
88 submitted locally or received from another PE for forwarding. Because only peers
89 communicate directly in the virtual grid, it suffices to have a single buffer
90 per PE for every peer. Given a dimension d within the virtual topology, let
91 $s_d$ denote its \emph{size}, or the number of distinct values a coordinate for
92 dimension d can take. Consequently, each local instance allocates up to $s_d - 1
93 $ buffers per dimension, for a total of $\sum_{d=0}^{N-1} (s_d - 1) $
94 buffers. Note that this is normally significantly less than the total number of
95 PEs specified by the virtual topology, which is equal to $\prod_{d=0}^{N-1}
96 {s_d}$.
98 Sending with TRAM is done by submitting a data item and a
99 destination identifier, either PE or array index, using a function
100 call to the local instance. If the index belongs to a peer, the
101 library places the data item in the buffer for the peer's
102 PE. Otherwise, the library calculates the index of the intermediate
103 destination using the previously described algorithm, and places the
104 data item in the buffer for the resulting PE, which by design is
105 always a peer of the local PE. Buffers are sent out immediately
106 when they become full. When a message is received at an intermediate
107 destination, the data items comprising it are distributed into the
108 appropriate buffers for subsequent sending. In the process, if a data
109 item is determined to have reached its final destination, it is
110 immediately delivered.
112 The total buffering capacity specified by the user may be reached even
113 when no single buffer is completely filled up. In that case the buffer
114 with the greatest number of buffered data items is sent.
116 \section{Application User Interface}
118 A typical usage scenario for TRAM involves a start-up phase followed by one or
119 more \emph{communication steps}. We next describe the application user interface
120 and details relevant to usage of the library, which normally follows these
121 steps:
123 \begin{enumerate}
124 \item{\textbf{Start-up}} Creation of a TRAM group and set up of client
125 arrays and groups
126 \item{\textbf{Initialization}} Calling an initialization function,
127 which returns through a callback
128 \item{\textbf{Sending}} An arbitrary number of sends using the
129 \code{insertData} function call on the local instance of the library
130 \item{\textbf{Receiving}} Processing received data items through the
131 \code{process} function which serves as the delivery
132 interface for the library and must be defined by the user
133 \item{\textbf{Termination}} Termination of a communication step
134 \item{\textbf{Re-initialization}} After termination of a communication step, the
135 library instance is not active. However, re-initialization using step $2$
136 leads to a new communication step.
137 \end{enumerate}
139 \subsection{Start-Up}
141 Start-up is typically performed once in a program, often inside the \code{main}
142 function of the mainchare, and involves creating an aggregator instance. An
143 instance of TRAM is restricted to sending data items of a single user-specified
144 type, which we denote by \code{dtype}, to a single user-specified chare array or
145 group.
147 \subsubsection{Sending to a Group}
148 To use TRAM for sending to a group, a \code{GroupMeshStreamer} group should be
149 created. Either of the following two \code{GroupMeshStreamer} constructors can
150 be used for that purpose:
152 \begin{alltt}
153 template<class dtype, class ClientType, class RouterType>
154 GroupMeshStreamer<dtype, ClientType, RouterType>::
155 GroupMeshStreamer(int maxNumDataItemsBuffered,
156 int numDimensions,
157 int *dimensionSizes,
158 CkGroupID clientGID,
159 bool yieldFlag = 0,
160 double progressPeriodInMs = -1.0);
162 template<class dtype, class ClientType, class RouterType>
163 GroupMeshStreamer<dtype, ClientType, RouterType>::
164 GroupMeshStreamer(int numDimensions,
165 int *dimensionSizes,
166 CkGroupID clientGID,
167 int bufferSize,
168 bool yieldFlag = 0,
169 double progressPeriodInMs = -1.0);
171 \end{alltt}
173 \subsubsection{Sending to a Chare Array}
174 For sending to a chare array, an \code{ArrayMeshStreamer} group should be
175 created, which has a similar constructor interface to \code{GroupMeshStreamer}:
177 \begin{alltt}
178 template <class dtype, class itype, class ClientType,
179 class RouterType>
180 ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
181 ArrayMeshStreamer(int maxNumDataItemsBuffered,
182 int numDimensions,
183 int *dimensionSizes,
184 CkArrayID clientAID,
185 bool yieldFlag = 0,
186 double progressPeriodInMs = -1.0);
188 template <class dtype, class itype, class ClientType,
189 class RouterType>
190 ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
191 ArrayMeshStreamer(int numDimensions,
192 int *dimensionSizes,
193 CkArrayID clientAID,
194 int bufferSize,
195 bool yieldFlag = 0,
196 double progressPeriodInMs = -1.0);
198 \end{alltt}
200 Description of parameters:
201 \begin{itemize}
202 \item \code{maxNumDataItemsBuffered}: maximum number of items that the
203 library is allowed to buffer per PE
204 \item \code{numDimensions}: number of dimensions in grid of PEs
205 \item \code{dimensionSizes}: array of size \code{numDimensions} containing the
206 size of each dimension in the grid
207 \item \code{clientGID}: the group ID for the client group
208 \item \code{clientAID}: the array ID for the client array
209 \item \code{bufferSize}: size of the buffer for each peer,
210 in terms of number of data items
211 \item \code{yieldFlag}: when true, calls \code{CthYield()} after every $1024$
212 item insertions; setting it true requires all data items to be submitted from
213 threaded entry methods. Ensures that pending messages are sent out by the
214 runtime system when a large number of data items are submitted from a single
215 entry method.
216 \item \code{progressPeriodInMs}: number of milliseconds between periodic
217 progress checks; relevant only when periodic flushing is enabled (see
218 Section~\ref{sec:tram_termination})
219 \end{itemize}
221 Template parameters:
222 \begin{itemize}
223 \item \code{dtype}: data item type
224 \item \code{itype}: index type of client chare array (use \code{int} for
225 one-dimensional chare arrays and \code{CkArrayIndex} for all other index
226 types)
227 \item \code{ClientType}: type of client group or array
228 \item \code{RouterType}: the routing protocol to be used. The choices are: \\
229 (1) \code{SimpleMeshRouter} - original grid aggregation scheme; \\
230 (2) \code{NodeAwareMeshRouter} - base node-aware aggregation scheme; \\
231 (3) \code{AggressiveNodeAwareMeshRouter} - advanced node-aware aggregation scheme;
232 \end{itemize}
234 \subsection{Initialization}
236 A TRAM instance needs to be initialized before every communication step. There
237 are currently three main modes of operation, depending on the type of
238 termination used: \emph{staged completion}, \emph{completion detection}, or
239 \emph{quiescence detection}. The modes of termination are described later. Here,
240 we present the interface for initializing a communication step for each of the
241 three modes.
243 When using completion detection, each local instance of TRAM must be initialized
244 using the following variant of the overloaded \code{init} function:
246 \begin{alltt}
247 template <class dtype, class RouterType>
248 void MeshStreamer<dtype, RouterType>::
249 init(int numContributors,
250 CkCallback startCb,
251 CkCallback endCb,
252 CProxy_CompletionDetector detector,
253 int prio,
254 bool usePeriodicFlushing);
255 \end{alltt}
257 Description of parameters:
259 \begin{itemize}
260 \item \code{numContributors}: number of \code{done} calls expected globally
261 before termination of this communication step
262 \item \code{startCb}: callback to be invoked by the library after
263 initialization is complete
264 \item \code{endCb}: callback to be invoked by the library after termination
265 of this communication step
266 \item \code{detector}: an inactive \code{CompletionDetector} object to be used by TRAM
267 \item \code{prio}: Charm++ priority to be used for messages sent using TRAM in
268 this communication step
269 \item \code{usePeriodicFlushing}: specifies whether periodic flushing should
270 be used for this communication step
271 \end{itemize}
273 When using staged completion, a completion detector object is not required as
274 input, as the library performs its own specialized form of termination. In this
275 case, each local instance of TRAM must be initialized using a different
276 interface for the overloaded \code{init} function:
278 \begin{alltt}
279 template <class dtype, class RouterType>
280 void MeshStreamer<dtype, RouterType>::
281 init(int numLocalContributors,
282 CkCallback startCb,
283 CkCallback endCb,
284 int prio,
285 bool usePeriodicFlushing);
287 \end{alltt}
289 Note that \code{numLocalContributors} denotes the local number of \code{done}
290 calls expected, rather than the global as in the first interface of \code{init}.
292 A common case is to have a single chare array perform all the sends in a
293 communication step, with each element of the array as a contributor. For this
294 case there is a special version of \code{init} that takes as input the
295 \code{CkArrayID} object for the chare array that will perform the sends,
296 precluding the need to manually determine the number of client chares per PE:
298 \begin{alltt}
299 template <class dtype, class RouterType>
300 void MeshStreamer<dtype, RouterType>::
301 init(CkArrayID senderArrayID,
302 CkCallback startCb,
303 CkCallback endCb,
304 int prio,
305 bool usePeriodicFlushing);
306 \end{alltt}
308 The \code{init} interface for using quiescence detection is:
310 \begin{alltt}
311 template <class dtype, class RouterType>
312 void MeshStreamer<dtype, RouterType>::init(CkCallback startCb,
313 int prio);
314 \end{alltt}
316 After initialization is finished, the system invokes \code{startCb},
317 signaling to the user that the library is ready to accept data items
318 for sending.
321 %\textbf{proposed addition: add a non-streaming all-to-all interface,
322 % which should not require an init call}
324 \subsection{Sending}
326 Sending with TRAM is done through calls to \code{insertData} and
327 \code{broadcast}.
329 \begin{alltt}
330 template <class dtype, class RouterType>
331 void MeshStreamer<dtype, RouterType>::
332 insertData(const dtype& dataItem,
333 int destinationPe);
335 template <class dtype, class itype, class ClientType,
336 class RouterType>
337 void ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
338 insertData(const dtype& dataItem,
339 itype arrayIndex);
341 template <class dtype, class RouterType>
342 void MeshStreamer<dtype, RouterType>::
343 broadcast(const dtype& dataItem);
344 \end{alltt}
346 \begin{itemize}
347 \item \code{dataItem}: reference to a data item to be sent
348 \item \code{destinationPe}: index of destination PE
349 \item \code{arrayIndex}: index of destination array element
350 \end{itemize}
352 Broadcasting has the effect of delivering the data item:
353 \begin{itemize}
354 \item {once on every PE involved in the computation for \code{GroupMeshStreamer}}
355 \item {once for every array element involved in the computation for
356 \code{ArrayMeshStreamer}}
357 \end{itemize}
359 %\textbf{proposed addition: add an interface for sending an array of
360 % data items}
362 \subsection{Receiving}
364 To receive data items sent using TRAM, the user must define
365 the \code{process} function for each client group and array:
367 \begin{alltt}
368 void process(const dtype &ran);
369 \end{alltt}
371 Each item is delivered by the library using a separate call to \code{process} on
372 the destination PE. The call is made locally, so process should not be an entry
373 method.
375 \subsection{Termination}
376 \label{sec:tram_termination}
378 Flushing and termination mechanisms are used in TRAM to prevent deadlock due to
379 indefinite buffering of items. Flushing works by sending out all buffers in a
380 local instance if no items have been submitted or received since the last
381 progress check. Meanwhile, termination detection is used to send out partially
382 filled buffers at the end of a communication step after it has been determined
383 that no additional items will be submitted.
385 Currently, three means of termination are supported: staged completion,
386 completion detection, and quiescence detection. Periodic flushing is a secondary
387 mechanism which can be enabled or disabled when initiating one of the primary
388 mechanisms.
390 Termination typically requires the user to issue a number of calls to the
391 \code{done} function:
392 \begin{alltt}
393 template <class dtype, class RouterType>
394 void MeshStreamer<dtype, RouterType>::
395 done(int numContributorsFinished = 1);
396 \end{alltt}
397 When using completion detection, the number of done calls that are expected
398 globally by the TRAM instance is specified using the \code{numContributors}
399 parameter to \code{init}. Safe termination requires that no calls to
400 \code{insertData} or \code{broadcast} are made after the last call to
401 \code{done} is performed globally. Because order of execution is uncertain in
402 parallel applications, some care is required to ensure the above condition is
403 met. A simple way to terminate safely is to set \code{numContributors} equal to
404 the number of senders, and call done once for each sender that is done
405 submitting items.
407 In contrast to using completion detection, using staged completion involves
408 setting the local number of expected calls to \code{done} using the
409 \code{numLocalContributors} parameter in the \code{init} function. To ensure
410 safe termination, no \code{insertData} or \code{broadcast} calls should be made
411 on any PE where \code{done} has been called the expected number of times.
413 Another version of \code{init} for staged completion, which takes a
414 \code{CkArrayID} object as an argument, provides a simplified interface in the
415 common case when a single chare array performs all the sends within a
416 communication step, with each of its elements as a contributor. For this version
417 of \code{init}, TRAM determines the appropriate number of local contributors
418 automatically. It also correctly handles the case of PEs without any
419 contributors by immediately marking those PEs as having finished the
420 communication step. As such, this version of \code{init} should be preferred by
421 the user when applicable.
423 Staged completion is not supported when array location data is not guaranteed to
424 be correct, as this can potentially violate the termination conditions used to
425 guarantee successful termination. In order to guarantee correct location data in
426 applications that use load balancing, Charm++ must be compiled with
427 \code{-DCMK\underline{\hspace{.2cm}}GLOBAL\underline{\hspace{.2cm}}LOCATION\underline{\hspace{.2cm}}UPDATE},
428 which has the effect of performing a global broadcast of location data for chare
429 array elements that migrate during load balancing. Unfortunately, this operation
430 is expensive when migrating large numbers of elements. As an alternative,
431 completion detection and quiescence detection modes will work properly without
432 the global location update mechanism, and even in the case of anytime migration.
434 When using quiescence detection, no end callback is used, and no \code{done}
435 calls are required. Instead, termination of a communication step is achieved
436 using the quiescence detection framework in \charm{}, which supports passing a
437 callback as parameter. TRAM is set up such that quiescence will not be detected
438 until all items sent in the current communication step have been delivered to
439 their final destinations.
441 The choice of which termination mechanism to use is left to the user. Using
442 completion detection mode is more convenient when the global number of
443 contributors is known, while staged completion is easier to use if the local
444 number of contributors can be determined with ease, or if sending is done from
445 the elements of a chare array. If either mode can be used with ease, staged
446 completion should be preferred. Unlike the other mechanisms, staged completion
447 does not involve persistent background communication to determine when the
448 global number of expected \code{done} calls is reached. Staged completion is
449 also generally faster at reaching termination due to not being dependent on
450 periodic progress checks. Unlike completion detection, staged completion does
451 incur a small bandwidth overhead ($4$ bytes) for every TRAM message, but in
452 practice this is more than offset by the persistent traffic incurred by
453 completion detection.
455 Periodic flushing is an auxiliary mechanism which checks at a regular interval
456 whether any sends have taken place since the last time the check was
457 performed. If not, the mechanism sends out all the data items buffered per local
458 instance of the library. The period is specified by the user in the TRAM
459 constructor. A typical use case for periodic flushing is when the submission of
460 a data item B to TRAM happens as a result of the delivery of another data item A
461 sent using the same TRAM instance. If A is buffered inside the library and
462 insufficient data items are submitted to cause the buffer holding A to be sent
463 out, a deadlock could arise. With the periodic flushing mechanism, the buffer
464 holding A is guaranteed to be sent out eventually, and deadlock is
465 prevented. Periodic flushing is required when using the completion detection or
466 quiescence detection termination modes.
468 \subsection{Re-initialization}
470 A TRAM instance that has terminated cannot be used for sending more data items
471 until it has been re-initialized. Re-initialization is achieved by calling
472 \code{init}, which prepares the instance of the library for a new communication
473 step. Re-initialization is useful for iterative applications, where it is often
474 convenient to have a single communication step per iteration of the application.
476 \subsection{Charm++ Registration of Templated Classes}
478 Due to the use of templates in TRAM, the library template instances must be
479 explicitly registered with the Charm++ runtime by the user of the library. This
480 must be done in the \code{.ci} file for the application, and typically involves
481 three steps.
483 For \code{GroupMeshStreamer} template instances, registration is done as
484 follows:
485 \begin{itemize}
486 \item{Registration of the message type:}
487 \begin{alltt}
488 message MeshStreamerMessage<dtype>;
489 \end{alltt}
490 \item{Registration of the base aggregator class}
491 \begin{alltt}
492 group MeshStreamer<dtype, RouterType>;
493 \end{alltt}
494 \item{Registration of the derived aggregator class}
495 \begin{alltt}
496 group GroupMeshStreamer<dtype, ClientType, RouterType>;
497 \end{alltt}
498 \end{itemize}
500 For \code{ArrayMeshStreamer} template instances, registration is done as
501 follows:
502 \begin{itemize}
503 \item{Registration of the message type:}
504 \begin{alltt}
505 message MeshStreamerMessage<ArrayDataItem<dtype, itype> >;
506 \end{alltt}
507 \item{Registration of the base aggregator class}
508 \begin{alltt}
509 group MeshStreamer<ArrayDataItem<dtype, itype>,
510 RouterType>;
511 \end{alltt}
512 \item{Registration of the derived aggregator class}
513 \begin{alltt}
514 group ArrayMeshStreamer<dtype, itype, ClientType,
515 RouterType>;
516 \end{alltt}
517 \end{itemize}
519 \section{Example}
521 For example code showing how to use TRAM, see examples/charm++/TRAM and
522 tests/charm++/streamingAllToAll in the Charm++ repository.
524 \thispagestyle{empty}
525 \pagestyle{empty}