doc/libraries/tram.tex

   1 \newcommand{\code}[1]{\textsf{#1}}
   2 \label{TRAM}
   3 \section{Overview}
   4
   5 Topological Routing and Aggregation Module is a library for optimization of
   6 many-to-many and all-to-all collective communication patterns in Charm++
   7 applications. The library performs topological routing and aggregation of
   8 network communication in the context of a virtual grid topology comprising the
   9 Charm++ Processing Elements (PEs) in the parallel run. The number of dimensions
  10 and their sizes within this topology are specified by the user when initializing
  11 an instance of the library.
  12
  13 TRAM is implemented as a Charm++ group, so an \emph{instance}
  14 of TRAM has one object on every PE used in the run. We use
  15 the term \emph{local instance} to denote a member of the TRAM
  16 group on a particular PE.
  17
  18 Most collective communication patterns involve sending linear arrays
  19 of a single data type. In order to more efficiently aggregate and
  20 process data, TRAM restricts the data sent using the
  21 library to a single data type specified by the user through a template
  22 parameter when initializing an instance of the library. We use the
  23 term \emph{data item} to denote a single object of this datatype
  24 submitted to the library for sending. While the library is active
  25 (i.e. after initialization and before termination), an arbitrary
  26 number of data items can be submitted to the library at each PE.
  27
  28 On systems with an underlying grid or torus network topology, it can be
  29 beneficial to configure the virtual topology for TRAM to match the physical
  30 topology of the network. This can easily be accomplished using the Charm++
  31 Topology Manager.
  32
  33 The next two sections explain the routing and aggregation techniques
  34 used in the library.
  35
  36 \subsection{Routing}
  37
  38 Let the variables $j$ and $k$ denote PEs within an N-dimensional
  39 virtual topology of PEs and $x$ denote a dimension of the grid. We
  40 represent the coordinates of $j$ and $k$ within the grid as $\left
  41 (j_0, j_1, \ldots, j_{N-1} \right) $ and $ \left (k_0, k_1, \ldots,
  42 k_{N-1} \right) $.  Also, let
  43 \[
  44 f(x, j, k) =
  45 \begin{cases}
  46 0, & \text{if } j_x = k_x \\
  47 1, & \text{if } j_x \ne k_x
  48 \end{cases}
  49 \]
  50
  51 $j$ and $k$ are \emph{peers} if
  52 \[
  53 \sum_{d=0}^{N-1} f(d, j, k) = 1 .
  54 \]
  55 When using TRAM, PEs communicate directly only with their
  56 peers. Sending to a PE which is not a peer is handled inside the
  57 library by routing the data through one or more \emph{intermediate
  58   destinations} along the route to the \emph{final destination}.
  59
  60 Suppose a data item destined for PE $k$ is submitted to the library at
  61 PE $j$. If $k$ is a peer of $j$, the data item will be sent directly
  62 to $k$, possibly along with other data items for which $k$ is the
  63 final or intermediate destination.  If $k$ is not a peer of $j$, the
  64 data item will be sent to an intermediate destination $m$ along the
  65 route to $k$ whose index is $\left (j_0, j_1, \ldots, j_{i-1}, k_i,
  66 j_{i+1}, \ldots, j_{N-1} \right)$, where $i$ is the greatest value of
  67 $x$ for which $f(x, j, k) = 1$.
  68
  69 Note that in obtaining the coordinates of $m$ from $j$, exactly one of
  70 the coordinates of $j$ which differs from the coordinates of $k$ is
  71 made to agree with $k$. It follows that m is a peer of $j$, and that
  72 using this routing process at $m$ and every subsequent intermediate
  73 destination along the route eventually leads to the data item being
  74 received at $k$. Consequently, the number of messages $F(j, k)$ that
  75 will carry the data item to the destination is
  76 \[
  77 F(j,k) = \sum_{d=0}^{N-1}f(d, j, k) .
  78 \]
  79
  80 \subsection{Aggregation}
  81
  82 Communicating over the network of a parallel machine involves per
  83 message bandwidth and processing overhead. TRAM amortizes
  84 this overhead by aggregating data items at the source and every
  85 intermediate destination along the route to the final destination.
  86
  87 Every local instance of the TRAM group buffers the data items that have been
  88 submitted locally or received from another PE for forwarding. Because only peers
  89 communicate directly in the virtual grid, it suffices to have a single buffer
  90 per PE for every peer. Given a dimension d within the virtual topology, let
  91 $s_d$ denote its \emph{size}, or the number of distinct values a coordinate for
  92 dimension d can take. Consequently, each local instance allocates up to $s_d - 1
  93 $ buffers per dimension, for a total of $\sum_{d=0}^{N-1} (s_d - 1) $
  94 buffers. Note that this is normally significantly less than the total number of
  95 PEs specified by the virtual topology, which is equal to $\prod_{d=0}^{N-1}
  96 {s_d}$.
  97
  98 Sending with TRAM is done by submitting a data item and a
  99 destination identifier, either PE or array index, using a function
 100 call to the local instance.  If the index belongs to a peer, the
 101 library places the data item in the buffer for the peer's
 102 PE. Otherwise, the library calculates the index of the intermediate
 103 destination using the previously described algorithm, and places the
 104 data item in the buffer for the resulting PE, which by design is
 105 always a peer of the local PE. Buffers are sent out immediately
 106 when they become full. When a message is received at an intermediate
 107 destination, the data items comprising it are distributed into the
 108 appropriate buffers for subsequent sending. In the process, if a data
 109 item is determined to have reached its final destination, it is
 110 immediately delivered.
 111
 112 The total buffering capacity specified by the user may be reached even
 113 when no single buffer is completely filled up. In that case the buffer
 114 with the greatest number of buffered data items is sent.
 115
 116 \section{Application User Interface}
 117
 118 A typical usage scenario for TRAM involves a start-up phase followed by one or
 119 more \emph{communication steps}. We next describe the application user interface
 120 and details relevant to usage of the library, which normally follows these
 121 steps:
 122
 123 \begin{enumerate}
 124 \item{\textbf{Start-up}} Creation of a TRAM group and set up of client
 125   arrays and groups
 126 \item{\textbf{Initialization}} Calling an initialization function,
 127   which returns through a callback
 128 \item{\textbf{Sending}} An arbitrary number of sends using the
 129   \code{insertData} function call on the local instance of the library
 130 \item{\textbf{Receiving}} Processing received data items through the
 131   \code{process} function which serves as the delivery
 132   interface for the library and must be defined by the user
 133 \item{\textbf{Termination}} Termination of a communication step
 134 \item{\textbf{Re-initialization}} After termination of a communication step, the
 135   library instance is not active. However, re-initialization using step $2$
 136   leads to a new communication step.
 137 \end{enumerate}
 138
 139 \subsection{Start-Up}
 140
 141 Start-up is typically performed once in a program, often inside the \code{main}
 142 function of the mainchare, and involves creating an aggregator instance. An
 143 instance of TRAM is restricted to sending data items of a single user-specified
 144 type, which we denote by \code{dtype}, to a single user-specified chare array or
 145 group.
 146
 147 \subsubsection{Sending to a Group}
 148 To use TRAM for sending to a group, a \code{GroupMeshStreamer} group should be
 149 created. Either of the following two \code{GroupMeshStreamer} constructors can
 150 be used for that purpose:
 151
 152 \begin{alltt}
 153 template<class dtype, class ClientType, class RouterType>
 154 GroupMeshStreamer<dtype, ClientType, RouterType>::
 155 GroupMeshStreamer(int maxNumDataItemsBuffered,
 156                   int numDimensions,
 157                   int *dimensionSizes,
 158                   CkGroupID clientGID,
 159                   bool yieldFlag = 0,
 160                   double progressPeriodInMs = -1.0);
 161
 162 template<class dtype, class ClientType, class RouterType>
 163 GroupMeshStreamer<dtype, ClientType, RouterType>::
 164 GroupMeshStreamer(int numDimensions,
 165                   int *dimensionSizes,
 166                   CkGroupID clientGID,
 167                   int bufferSize,
 168                   bool yieldFlag = 0,
 169                   double progressPeriodInMs = -1.0);
 170
 171 \end{alltt}
 172
 173 \subsubsection{Sending to a Chare Array}
 174 For sending to a chare array, an \code{ArrayMeshStreamer} group should be
 175 created, which has a similar constructor interface to \code{GroupMeshStreamer}:
 176
 177 \begin{alltt}
 178 template <class dtype, class itype, class ClientType,
 179           class RouterType>
 180 ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 181 ArrayMeshStreamer(int maxNumDataItemsBuffered,
 182                   int numDimensions,
 183                   int *dimensionSizes,
 184                   CkArrayID clientAID,
 185                   bool yieldFlag = 0,
 186                   double progressPeriodInMs = -1.0);
 187
 188 template <class dtype, class itype, class ClientType,
 189           class RouterType>
 190 ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 191 ArrayMeshStreamer(int numDimensions,
 192                   int *dimensionSizes,
 193                   CkArrayID clientAID,
 194                   int bufferSize,
 195                   bool yieldFlag = 0,
 196                   double progressPeriodInMs = -1.0);
 197
 198 \end{alltt}
 199
 200 Description of parameters:
 201 \begin{itemize}
 202 \item \code{maxNumDataItemsBuffered}: maximum number of items that the
 203   library is allowed to buffer per PE
 204 \item \code{numDimensions}: number of dimensions in grid of PEs
 205 \item \code{dimensionSizes}: array of size \code{numDimensions} containing the
 206   size of each dimension in the grid
 207 \item \code{clientGID}: the group ID for the client group
 208 \item \code{clientAID}: the array ID for the client array
 209 \item \code{bufferSize}: size of the buffer for each peer,
 210   in terms of number of data items
 211 \item \code{yieldFlag}: when true, calls \code{CthYield()} after every $1024$
 212   item insertions; setting it true requires all data items to be submitted from
 213   threaded entry methods. Ensures that pending messages are sent out by the
 214   runtime system when a large number of data items are submitted from a single
 215   entry method.
 216 \item \code{progressPeriodInMs}: number of milliseconds between periodic
 217   progress checks; relevant only when periodic flushing is enabled (see
 218   Section~\ref{sec:tram_termination})
 219 \end{itemize}
 220
 221 Template parameters:
 222 \begin{itemize}
 223 \item \code{dtype}: data item type
 224 \item \code{itype}: index type of client chare array (use \code{int} for
 225   one-dimensional chare arrays and \code{CkArrayIndex} for all other index
 226   types)
 227 \item \code{ClientType}: type of client group or array
 228 \item \code{RouterType}: the routing protocol to be used. The choices are: \\
 229   (1) \code{SimpleMeshRouter} - original grid aggregation scheme; \\
 230   (2) \code{NodeAwareMeshRouter} - base node-aware aggregation scheme; \\
 231   (3) \code{AggressiveNodeAwareMeshRouter} - advanced node-aware aggregation scheme;
 232 \end{itemize}
 233
 234 \subsection{Initialization}
 235
 236 A TRAM instance needs to be initialized before every communication step. There
 237 are currently three main modes of operation, depending on the type of
 238 termination used: \emph{staged completion}, \emph{completion detection}, or
 239 \emph{quiescence detection}. The modes of termination are described later. Here,
 240 we present the interface for initializing a communication step for each of the
 241 three modes.
 242
 243 When using completion detection, each local instance of TRAM must be initialized
 244 using the following variant of the overloaded \code{init} function:
 245
 246 \begin{alltt}
 247 template <class dtype, class RouterType>
 248 void MeshStreamer<dtype, RouterType>::
 249 init(int numContributors,
 250      CkCallback startCb,
 251      CkCallback endCb,
 252      CProxy_CompletionDetector detector,
 253      int prio,
 254      bool usePeriodicFlushing);
 255 \end{alltt}
 256
 257 Description of parameters:
 258
 259 \begin{itemize}
 260 \item \code{numContributors}: number of \code{done} calls expected globally
 261   before termination of this communication step
 262 \item \code{startCb}: callback to be invoked by the library after
 263   initialization is complete
 264 \item \code{endCb}: callback to be invoked by the library after termination
 265   of this communication step
 266 \item \code{detector}: an inactive \code{CompletionDetector} object to be used by TRAM
 267 \item \code{prio}: Charm++ priority to be used for messages sent using TRAM in
 268   this communication step
 269 \item \code{usePeriodicFlushing}: specifies whether periodic flushing should
 270   be used for this communication step
 271 \end{itemize}
 272
 273 When using staged completion, a completion detector object is not required as
 274 input, as the library performs its own specialized form of termination. In this
 275 case, each local instance of TRAM must be initialized using a different
 276 interface for the overloaded \code{init} function:
 277
 278 \begin{alltt}
 279 template <class dtype, class RouterType>
 280 void MeshStreamer<dtype, RouterType>::
 281 init(int numLocalContributors,
 282      CkCallback startCb,
 283      CkCallback endCb,
 284      int prio,
 285      bool usePeriodicFlushing);
 286
 287 \end{alltt}
 288
 289 Note that \code{numLocalContributors} denotes the local number of \code{done}
 290 calls expected, rather than the global as in the first interface of \code{init}.
 291
 292 A common case is to have a single chare array perform all the sends in a
 293 communication step, with each element of the array as a contributor. For this
 294 case there is a special version of \code{init} that takes as input the
 295 \code{CkArrayID} object for the chare array that will perform the sends,
 296 precluding the need to manually determine the number of client chares per PE:
 297
 298 \begin{alltt}
 299 template <class dtype, class RouterType>
 300 void MeshStreamer<dtype, RouterType>::
 301 init(CkArrayID senderArrayID,
 302      CkCallback startCb,
 303      CkCallback endCb,
 304      int prio,
 305      bool usePeriodicFlushing);
 306 \end{alltt}
 307
 308 The \code{init} interface for using quiescence detection is:
 309
 310 \begin{alltt}
 311 template <class dtype, class RouterType>
 312 void MeshStreamer<dtype, RouterType>::init(CkCallback startCb,
 313                                            int prio);
 314 \end{alltt}
 315
 316 After initialization is finished, the system invokes \code{startCb},
 317 signaling to the user that the library is ready to accept data items
 318 for sending.
 319 \\
 320
 321 %\textbf{proposed addition: add a non-streaming all-to-all interface,
 322 %  which should not require an init call}
 323
 324 \subsection{Sending}
 325
 326 Sending with TRAM is done through calls to \code{insertData} and
 327 \code{broadcast}.
 328
 329 \begin{alltt}
 330 template <class dtype, class RouterType>
 331 void MeshStreamer<dtype, RouterType>::
 332 insertData(const dtype& dataItem,
 333            int destinationPe);
 334
 335 template <class dtype, class itype, class ClientType,
 336           class RouterType>
 337 void ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 338 insertData(const dtype& dataItem,
 339            itype arrayIndex);
 340
 341 template <class dtype, class RouterType>
 342 void MeshStreamer<dtype, RouterType>::
 343 broadcast(const dtype& dataItem);
 344 \end{alltt}
 345
 346 \begin{itemize}
 347 \item \code{dataItem}: reference to a data item to be sent
 348 \item \code{destinationPe}: index of destination PE
 349 \item \code{arrayIndex}: index of destination array element
 350 \end{itemize}
 351
 352 Broadcasting has the effect of delivering the data item:
 353 \begin{itemize}
 354 \item {once on every PE involved in the computation for \code{GroupMeshStreamer}}
 355 \item {once for every array element involved in the computation for
 356        \code{ArrayMeshStreamer}}
 357 \end{itemize}
 358
 359 %\textbf{proposed addition: add an interface for sending an array of
 360 %  data items}
 361
 362 \subsection{Receiving}
 363
 364 To receive data items sent using TRAM, the user must define
 365 the \code{process} function for each client group and array:
 366
 367 \begin{alltt}
 368 void process(const dtype  &ran);
 369 \end{alltt}
 370
 371 Each item is delivered by the library using a separate call to \code{process} on
 372 the destination PE. The call is made locally, so process should not be an entry
 373 method.
 374
 375 \subsection{Termination}
 376 \label{sec:tram_termination}
 377
 378 Flushing and termination mechanisms are used in TRAM to prevent deadlock due to
 379 indefinite buffering of items. Flushing works by sending out all buffers in a
 380 local instance if no items have been submitted or received since the last
 381 progress check. Meanwhile, termination detection is used to send out partially
 382 filled buffers at the end of a communication step after it has been determined
 383 that no additional items will be submitted.
 384
 385 Currently, three means of termination are supported: staged completion,
 386 completion detection, and quiescence detection. Periodic flushing is a secondary
 387 mechanism which can be enabled or disabled when initiating one of the primary
 388 mechanisms.
 389
 390 Termination typically requires the user to issue a number of calls to the
 391 \code{done} function:
 392 \begin{alltt}
 393 template <class dtype, class RouterType>
 394 void MeshStreamer<dtype, RouterType>::
 395 done(int numContributorsFinished = 1);
 396 \end{alltt}
 397 When using completion detection, the number of done calls that are expected
 398 globally by the TRAM instance is specified using the \code{numContributors}
 399 parameter to \code{init}. Safe termination requires that no calls to
 400 \code{insertData} or \code{broadcast} are made after the last call to
 401 \code{done} is performed globally. Because order of execution is uncertain in
 402 parallel applications, some care is required to ensure the above condition is
 403 met. A simple way to terminate safely is to set \code{numContributors} equal to
 404 the number of senders, and call done once for each sender that is done
 405 submitting items.
 406
 407 In contrast to using completion detection, using staged completion involves
 408 setting the local number of expected calls to \code{done} using the
 409 \code{numLocalContributors} parameter in the \code{init} function. To ensure
 410 safe termination, no \code{insertData} or \code{broadcast} calls should be made
 411 on any PE where \code{done} has been called the expected number of times.
 412
 413 Another version of \code{init} for staged completion, which takes a
 414 \code{CkArrayID} object as an argument, provides a simplified interface in the
 415 common case when a single chare array performs all the sends within a
 416 communication step, with each of its elements as a contributor. For this version
 417 of \code{init}, TRAM determines the appropriate number of local contributors
 418 automatically. It also correctly handles the case of PEs without any
 419 contributors by immediately marking those PEs as having finished the
 420 communication step. As such, this version of \code{init} should be preferred by
 421 the user when applicable.
 422
 423 Staged completion is not supported when array location data is not guaranteed to
 424 be correct, as this can potentially violate the termination conditions used to
 425 guarantee successful termination. In order to guarantee correct location data in
 426 applications that use load balancing, Charm++ must be compiled with
 427 \code{-DCMK\underline{\hspace{.2cm}}GLOBAL\underline{\hspace{.2cm}}LOCATION\underline{\hspace{.2cm}}UPDATE},
 428 which has the effect of performing a global broadcast of location data for chare
 429 array elements that migrate during load balancing. Unfortunately, this operation
 430 is expensive when migrating large numbers of elements. As an alternative,
 431 completion detection and quiescence detection modes will work properly without
 432 the global location update mechanism, and even in the case of anytime migration.
 433
 434 When using quiescence detection, no end callback is used, and no \code{done}
 435 calls are required. Instead, termination of a communication step is achieved
 436 using the quiescence detection framework in \charm{}, which supports passing a
 437 callback as parameter. TRAM is set up such that quiescence will not be detected
 438 until all items sent in the current communication step have been delivered to
 439 their final destinations.
 440
 441 The choice of which termination mechanism to use is left to the user. Using
 442 completion detection mode is more convenient when the global number of
 443 contributors is known, while staged completion is easier to use if the local
 444 number of contributors can be determined with ease, or if sending is done from
 445 the elements of a chare array. If either mode can be used with ease, staged
 446 completion should be preferred. Unlike the other mechanisms, staged completion
 447 does not involve persistent background communication to determine when the
 448 global number of expected \code{done} calls is reached. Staged completion is
 449 also generally faster at reaching termination due to not being dependent on
 450 periodic progress checks. Unlike completion detection, staged completion does
 451 incur a small bandwidth overhead ($4$ bytes) for every TRAM message, but in
 452 practice this is more than offset by the persistent traffic incurred by
 453 completion detection.
 454
 455 Periodic flushing is an auxiliary mechanism which checks at a regular interval
 456 whether any sends have taken place since the last time the check was
 457 performed. If not, the mechanism sends out all the data items buffered per local
 458 instance of the library. The period is specified by the user in the TRAM
 459 constructor. A typical use case for periodic flushing is when the submission of
 460 a data item B to TRAM happens as a result of the delivery of another data item A
 461 sent using the same TRAM instance. If A is buffered inside the library and
 462 insufficient data items are submitted to cause the buffer holding A to be sent
 463 out, a deadlock could arise. With the periodic flushing mechanism, the buffer
 464 holding A is guaranteed to be sent out eventually, and deadlock is
 465 prevented. Periodic flushing is required when using the completion detection or
 466 quiescence detection termination modes.
 467
 468 \subsection{Re-initialization}
 469
 470 A TRAM instance that has terminated cannot be used for sending more data items
 471 until it has been re-initialized. Re-initialization is achieved by calling
 472 \code{init}, which prepares the instance of the library for a new communication
 473 step. Re-initialization is useful for iterative applications, where it is often
 474 convenient to have a single communication step per iteration of the application.
 475
 476 \subsection{Charm++ Registration of Templated Classes}
 477
 478 Due to the use of templates in TRAM, the library template instances must be
 479 explicitly registered with the Charm++ runtime by the user of the library. This
 480 must be done in the \code{.ci} file for the application, and typically involves
 481 three steps.
 482
 483 For \code{GroupMeshStreamer} template instances, registration is done as
 484 follows:
 485 \begin{itemize}
 486 \item{Registration of the message type:}
 487 \begin{alltt}
 488 message MeshStreamerMessage<dtype>;
 489 \end{alltt}
 490 \item{Registration of the base aggregator class}
 491 \begin{alltt}
 492 group MeshStreamer<dtype, RouterType>;
 493 \end{alltt}
 494 \item{Registration of the derived aggregator class}
 495 \begin{alltt}
 496 group GroupMeshStreamer<dtype, ClientType, RouterType>;
 497 \end{alltt}
 498 \end{itemize}
 499
 500 For \code{ArrayMeshStreamer} template instances, registration is done as
 501 follows:
 502 \begin{itemize}
 503 \item{Registration of the message type:}
 504 \begin{alltt}
 505 message MeshStreamerMessage<ArrayDataItem<dtype, itype> >;
 506 \end{alltt}
 507 \item{Registration of the base aggregator class}
 508 \begin{alltt}
 509 group MeshStreamer<ArrayDataItem<dtype, itype>,
 510                    RouterType>;
 511 \end{alltt}
 512 \item{Registration of the derived aggregator class}
 513 \begin{alltt}
 514 group ArrayMeshStreamer<dtype, itype, ClientType,
 515                         RouterType>;
 516 \end{alltt}
 517 \end{itemize}
 518
 519 \section{Example}
 520
 521 For example code showing how to use TRAM, see examples/charm++/TRAM and
 522 tests/charm++/streamingAllToAll in the Charm++ repository.
 523
 524 \thispagestyle{empty}
 525 \pagestyle{empty}