doc/libraries/manual.rst

   1 ==============================
   2 Converse and Charm++ Libraries
   3 ==============================
   4
   5 .. contents::
   6    :depth: 3
   7
   8 Introduction
   9 ============
  10
  11 This manual describes Charm++ and Converse libraries. This is a work in
  12 progress towards a standard library for parallel programming on top of
  13 the Converse and Charm++ system. All of these libraries are included in
  14 the source and binary distributions of Charm++/Converse.
  15
  16 liveViz Library
  17 ===============
  18
  19 .. _introduction-1:
  20
  21 Introduction
  22 ------------
  23
  24 If array elements compute a small piece of a large 2D image, then these
  25 image chunks can be combined across processors to form one large image
  26 using the liveViz library. In other words, liveViz provides a way to
  27 reduce 2D-image data, which combines small chunks of images deposited by
  28 chares into one large image.
  29
  30 This visualization library follows the client server model. The server,
  31 a parallel Charm++ program, does all image assembly, and opens a network
  32 (CCS) socket which clients use to request and download images. The
  33 client is a small Java program. A typical use of this is:
  34
  35 .. code-block:: bash
  36
  37         cd charm/examples/charm++/wave2d
  38         make
  39         ./charmrun ./wave2d +p2 ++server ++server-port 1234
  40         ~/ccs_tools/bin/liveViz localhost 1234
  41
  42 Use git to obtain a copy of ccs_tools (prior to using liveViz) and build
  43 it by:
  44
  45 .. code-block:: bash
  46
  47          cd ccs_tools;
  48          ant;
  49
  50 How to use liveViz with Charm++ program
  51 ---------------------------------------
  52
  53 The liveViz routines are in the Charm++ header “liveViz.h”.
  54
  55 A typical program provides a chare array with one entry method with the
  56 following prototype:
  57
  58 .. code-block:: c++
  59
  60      entry void functionName(liveVizRequestMsg *m);
  61
  62 This entry method is supposed to deposit its (array element’s) chunk of
  63 the image. This entry method has following structure:
  64
  65 .. code-block:: c++
  66
  67      void myArray::functionName (liveVizRequestMsg *m)
  68      {
  69        // prepare image chunk
  70           ...
  71
  72        liveVizDeposit (m, startX, startY, width, height, imageBuff, this);
  73
  74        // delete image buffer if it was dynamically allocated
  75      }
  76
  77 Here, “width” and “height” are the size, in pixels, of this array
  78 element’s portion of the image, contributed in “imageBuff” (described
  79 below). This will show up on the client’s assembled image at 0-based
  80 pixel (startX,startY). The client’s display width and height are stored
  81 in m->req.wid and m->req.ht.
  82
  83 By default, liveViz combines image chunks by doing a saturating sum of
  84 overlapping pixel values. If you want liveViz to combine image chunks by
  85 using max (i.e. for overlapping pixels in deposited image chunks, final
  86 image will have the pixel with highest intensity or in other words
  87 largest value), you need to pass one more parameter (liveVizCombine_t)
  88 to the “liveVizDeposit” function:
  89
  90 .. code-block:: c++
  91
  92     liveVizDeposit (m, startX, startY, width, height, imageBuff, this,
  93                     max_image_data);
  94
  95 You can also reduce floating-point image data using sum_float_image_data
  96 or max_float_image_data.
  97
  98 Format of deposit image
  99 -----------------------
 100
 101 “imageBuff” is run of bytes representing a rectangular portion of the
 102 image. This buffer represents image using a row-major format, so 0-based
 103 pixel (x,y) (x increasing to the right, y increasing downward in typical
 104 graphics fashion) is stored at array offset “x+y*width”.
 105
 106 If the image is gray-scale (as determined by liveVizConfig, below), each
 107 pixel is represented by one byte. If the image is color, each pixel is
 108 represented by 3 consecutive bytes representing red, green, and blue
 109 intensity.
 110
 111 If the image is floating-point, each pixel is represented by a single
 112 ‘float’, and after assembly colorized by calling the user-provided
 113 routine below. This routine converts fully assembled ‘float’ pixels to
 114 RGB 3-byte pixels, and is called only on processor 0 after each client
 115 request.
 116
 117 .. code-block:: c++
 118
 119   extern "C"
 120   void liveVizFloatToRGB(liveVizRequest &req,
 121       const float *floatSrc, unsigned char *destRgb,
 122       int nPixels);
 123
 124 liveViz Initialization
 125 ----------------------
 126
 127 liveViz library needs to be initialized before it can be used for
 128 visualization. For initialization follow the following steps from your
 129 main chare:
 130
 131 #. Create your chare array (array proxy object ’a’) with the entry
 132    method ’functionName’ (described above). You must create the chare
 133    array using a CkArrayOptions ’opts’ parameter. For instance,
 134
 135    .. code-block:: c++
 136
 137         CkArrayOptions opts(rows, cols);
 138         array = CProxy_Type::ckNew(opts);
 139
 140 #. Create a CkCallback object (’c’), specifying ’functionName’ as the
 141    callback function. This callback will be invoked whenever the client
 142    requests a new image.
 143
 144 #. Create a liveVizConfig object (’cfg’). LiveVizConfig takes a number
 145    of parameters, as described below.
 146
 147 #. Call liveVizInit (cfg, a, c, opts).
 148
 149 The liveVizConfig parameters are:
 150
 151 -  The first parameter is the pixel type to be reduced:
 152
 153    -  “false” or liveVizConfig::pix_greyscale means a greyscale image (1
 154       byte per pixel).
 155
 156    -  “true” or liveVizConfig::pix_color means a color image (3 RGB
 157       bytes per pixel).
 158
 159    -  liveVizConfig::pix_float means a floating-point color image (1
 160       float per pixel, can only be used with sum_float_image_data or
 161       max_float_image_data).
 162
 163 -  The second parameter is the flag “serverPush”, which is passed to the
 164    client application. If set to true, the client will repeatedly
 165    request for images. When set to false the client will only request
 166    for images when its window is resized and needs to be updated.
 167
 168 -  The third parameter is an optional 3D bounding box (type CkBbox3d).
 169    If present, this puts the client into a 3D visualization mode.
 170
 171 A typical 2D, RGB, non-push call to liveVizConfig looks like this:
 172
 173 .. code-block:: c++
 174
 175       liveVizConfig cfg(true,false);
 176
 177 Compilation
 178 -----------
 179
 180 A Charm++ program that uses liveViz must be linked with ’-module
 181 liveViz’.
 182
 183 Before compiling a liveViz program, the liveViz library may need to be
 184 compiled. To compile the liveViz library:
 185
 186 -  go to .../charm/tmp/libs/ck-libs/liveViz
 187
 188 -  make
 189
 190 Poll Mode
 191 ---------
 192
 193 In some cases you may want a server to deposit images only when it is
 194 ready to do so. For this case the server will not register a callback
 195 function that triggers image generation, but rather the server will
 196 deposit an image at its convenience. For example a server may want to
 197 create a movie or series of images corresponding to some timesteps in a
 198 simulation. The server will have a timestep loop in which an array
 199 computes some data for a timestep. At the end of each iteration the
 200 server will deposit the image. The use of LiveViz’s Poll Mode supports
 201 this type of server generation of images.
 202
 203 Poll Mode contains a few significant differences to the standard mode.
 204 First we describe the use of Poll Mode, and then we will describe the
 205 differences. liveVizPoll must get control during the creation of your
 206 array, so you call liveVizPollInit with no parameters.
 207
 208 .. code-block:: c++
 209
 210         liveVizPollInit();
 211         CkArrayOptions opts(nChares);
 212         arr = CProxy_lvServer::ckNew(opts);
 213
 214 To deposit an image, the server just calls liveVizPollDeposit. The
 215 server must take care not to generate too many images, before a client
 216 requests them. Each server generated image is buffered until the client
 217 can get the image. The buffered images will be stored in memory on
 218 processor 0.
 219
 220 .. code-block:: c++
 221
 222      liveVizPollDeposit(this,
 223                         startX,startY,            // Location of local piece
 224                         localSizeX,localSizeY,    // Dimensions of the piece I'm depositing
 225                         globalSizeX,globalSizeY,  // Dimensions of the entire image
 226                         img,                      // Image byte array
 227                         sum_image_data,           // Desired image combiner
 228                         3                         // Bytes/pixel
 229                        );
 230
 231 The last two parameters are optional. By default they are set to
 232 sum_image_data and 3 bytes per pixel.
 233
 234 A sample liveVizPoll server and client are available at:
 235
 236 .. code-block:: none
 237
 238               .../charm/examples/charm++/lvServer
 239               .../ccs_tools/bin/lvClient
 240
 241 This example server uses a PythonCCS command to cause an image to be
 242 generated by the server. The client also then gets the image.
 243
 244 LiveViz provides multiple image combiner types. Any supported type can
 245 be used as a parameter to liveVizPollDeposit. Valid combiners include:
 246 sum_float_image_data, max_float_image_data, sum_image_data, and
 247 max_image_data.
 248
 249 The differences in Poll Mode may be apparent. There is no callback
 250 function which causes the server to generate and deposit an image.
 251 Furthermore, a server may generate an image before or after a client has
 252 sent a request. The deposit function, therefore is more complicated, as
 253 the server will specify information about the image that it is
 254 generating. The client will no longer specify the desired size or other
 255 configuration options, since the server may generate the image before
 256 the client request is available to the server. The liveVizPollInit call
 257 takes no parameters.
 258
 259 The server should call Deposit with the same global size and combiner
 260 type on all of the array elements which correspond to the “this”
 261 parameter.
 262
 263 The latest version of liveVizPoll is not backwards compatable with older
 264 versions. The old version had some fundamental problems which would
 265 occur if a server generated an image before a client requested it. Thus
 266 the new version buffers server generated images until requested by a
 267 client. Furthermore the client requests are also buffered if they arrive
 268 before the server generates the images. Problems could also occur during
 269 migration with the old version.
 270
 271 Caveats
 272 -------
 273
 274 If you use the old version of “liveVizInit" method that only receives 3
 275 parameters, you will find a known bug caused by how “liveVizDeposit”
 276 internally uses a reduction to build the image.
 277
 278 Using that version of the “liveVizInit" method, its contribute call is
 279 handled as if it were the chare calling “liveVizDeposit” that actually
 280 contributed to the liveViz reduction. If there is any other reduction
 281 going on elsewhere in this chare, some liveViz contribute calls might be
 282 issued before the corresponding non-liveViz contribute is reached. This
 283 would imply that image data would be treated as if were part of the
 284 non-liveViz reduction, leading to unexpected behavior potentially
 285 anywhere in the non-liveViz code.
 286
 287 Multi-phase Shared Arrays Library
 288 =================================
 289
 290 The Multiphase Shared Arrays (MSA) library provides a specialized shared
 291 memory abstraction in Charm++ that provides automatic memory management.
 292 Explicitly shared memory provides the convenience of shared memory
 293 programming while exposing the performance issues to programmers and the
 294 “intelligent” ARTS.
 295
 296 Each MSA is accessed in one specific mode during each phase of
 297 execution: ``read-only`` mode, in which any thread can read any element
 298 of the array; ``write-once`` mode, in which each element of the array is
 299 written to (possibly multiple times) by at most one worker thread, and
 300 no reads are allowed and ``accumulate`` mode, in which any threads can
 301 add values to any array element, and no reads or writes are permitted. A
 302 ``sync`` call is used to denote the end of a phase.
 303
 304 We permit multiple copies of a page of data on different processors and
 305 provide automatic fetching and caching of remote data. For example,
 306 initially an array might be put in ``write-once`` mode while it is
 307 populated with data from a file. This determines the cache behavior and
 308 the permitted operations on the array during this phase. ``write-once``
 309 means every thread can write to a different element of the array. The
 310 user is responsible for ensuring that two threads do not write to the
 311 same element; the system helps by detecting violations. From the cache
 312 maintenance viewpoint, each page of the data can be over-written on it’s
 313 owning processor without worrying about transferring ownership or
 314 maintaining coherence. At the ``sync``, the data is simply merged.
 315 Subsequently, the array may be ``read-only`` for a while, thereafter
 316 data might be ``accumulate``\ ’d into it, followed by it returning to
 317 ``read-only`` mode. In the ``accumulate`` phase, each local copy of the
 318 page on each processor could have its accumulations tracked
 319 independently without maintaining page coherence, and the results
 320 combined at the end of the phase. The ``accumulate`` operations also
 321 include set-theoretic union operations, i.e. appending items to a set of
 322 objects would also be a valid ``accumulate`` operation. User-level or
 323 compiler-inserted explicit ``prefetch`` calls can be used to improve
 324 performance.
 325
 326 A software engineering benefit that accrues from the explicitly shared
 327 memory programming paradigm is the (relative) ease and simplicity of
 328 programming. No complex, buggy data-distribution and messaging
 329 calculations are required to access data.
 330
 331 To use MSA in a Charm++ program:
 332
 333 -  build Charm++ for your architecture, e.g. ``netlrts-linux``.
 334
 335 -  ``cd charm/netlrts-linux/tmp/libs/ck-libs/multiphaseSharedArrays/; make``
 336
 337 -  ``#include “msa/msa.h”`` in your header file.
 338
 339 -  Compile using ``charmc`` with the option ``-module msa``
 340
 341 The API is as follows: See the example programs in
 342 ``charm/pgms/charm++/multiphaseSharedArrays``.
 343
 344 3D FFT Library
 345 ==============
 346
 347 The previous 3D FFT library has been deprecated and replaced with this
 348 new 3D FFT library. The new 3D FFT library source can be downloaded with
 349 following command: *git clone
 350 https://charm.cs.illinois.edu/gerrit/libs/fft*
 351
 352 Introduction and Motivation
 353 ---------------------------
 354
 355 The 3D Charm-FFT library provides an interface to do parallel 3D FFT
 356 computation in a scalable fashion.
 357
 358 The parallelization is achieved by splitting the 3D transform into three
 359 phases, using 2D decomposition. First, 1D FFTs are computed over the
 360 pencils; then a ’transform’ is performed and 1D FFTs are done over
 361 second dimension; again a ’transform’ is performed and FFTs are computed
 362 over the last dimension. So this approach takes three computation phases
 363 and two ’transform’ phases.
 364
 365 This library allows users to create multiple instances of the library
 366 and perform concurrent FFTs using them. Each of the FFT instances run in
 367 background as other parts of user code execute, and a callback is
 368 invoked when FFT is complete.
 369
 370 Features
 371 --------
 372
 373 Charm-FFT library provides the following features:
 374
 375 -  *2D-decomposition*: Users can define fine-grained 2D-decomposition
 376    that increases the amount of available parallelism and improves
 377    network utilization.
 378
 379 -  *Cutoff-based smaller grid*: The data grid may have a cut off.
 380    Charm-FFT improves performance by avoiding communication and
 381    computation of the data beyond the cutoff.
 382
 383 -  *User-defined mapping of library objects*: The placement of objects
 384    that constitute the library instance can be defined by the user based
 385    on the application’s other concurrent communication and placement of
 386    other objects.
 387
 388 -  *Overlap with other computational work*: Given the callback-based
 389    interface and Charm++’s asynchrony, the FFTs are performed in the
 390    background while other application work can be done in parallel.
 391
 392 Compilation and Execution
 393 -------------------------
 394
 395 To install the FFT library, you will need to have charm++ installed in
 396 you system. You can follow the Charm++ manual to do that. Then, ensure
 397 that FFTW3 is installed. FFTW3 can be downloaded from
 398 *http://www.fftw.org*.  The Charm-FFT library source can be downloaded
 399 with following command: *git clone
 400 https://charm.cs.illinois.edu/gerrit/libs/fft*
 401
 402 Inside of Charm-FFT directory, you will find *Makefile.default*. Copy
 403 this file to *Makefile.common*, change the copy’s variable *FFT3_HOME*
 404 to point your FFTW3 installation and *CHARM_DIR* to point your Charm++
 405 installation then run *make*.  To use Charm-FFT library in an
 406 application, add the line *extern module fft_Charm;* to it charm
 407 interface (.ci) file and include *fft_charm.h* and *fftw3.h* in relevant
 408 C files. Finally to compile the program, pass *-lfft_charm* and -lfftw3
 409 as arguments to *charmc*.
 410
 411 Library Interface
 412 -----------------
 413
 414 To use Charm-FFT interface, the user must start by calling
 415 *Charm_createFFT* with following parameters.
 416
 417 .. code-block:: none
 418
 419        Charm_createFFT(N_x, N_y, N_z, z_x, z_y, y_x, y_z, x_yz, cutoff, hmati, fft_type, CkCallback);
 420
 421        Where:
 422        int N_x : X dimension of FFT calculation
 423        int N_y : Y dimension of FFT calculation
 424        int N_z : Z dimension of FFT calculation
 425        int z_x : X dimension of Z pencil chare array
 426        int z_y : Y dimension of Z pencil chare array
 427        int y_x : X dimension of Y pencil chare array
 428        int y_z : Z dimension of Y pencil chare array
 429        int x_yz: A dimension of X pencil chare array
 430        double cutoff: Cutoff of FFT grid
 431        double *hmati: Hamiltonian matrix representing cutoff
 432        FFT_TYPE: Type of FFT to perform. Either CC for complex-to-complex or RC for real-complex
 433        CkCallback: A Charm++ entry method for callback upon the completion of library initialization
 434
 435 This creates necessary proxies (Z,Y,X etc) for performing FFT of size
 436 :math:`N_x \times N_y * N_z` using 2D chare arrays (pencils) of size
 437 :math:`n_y \times n_x` (ZPencils), :math:`n_z \times n_x` (YPencils),
 438 and :math:`n_x \times n_y` (XPencils). When done, calls
 439 :math:`myCallback` which should receive :math:`CProxy\_fft2d\ id` as a
 440 unique identifier for the newly created set of proxies.
 441
 442 An example of Charm-FFT initialization using Charm_createFFT:
 443
 444 .. code-block:: c++
 445
 446   // .ci
 447   extern module fft_charm;
 448
 449   mainchare Main {
 450       entry Main(CkArgMsg *m);
 451   }
 452
 453   group Driver {
 454       entry Driver(FFT_Type fft_type);
 455       entry void proxyCreated(idMsg *msg);
 456       entry void fftDone();
 457   }
 458
 459   // .C
 460   Main::Main(CkArgMsg *m) {
 461       ...
 462       /* Assume FFT of size N_x, N_y, N_z */
 463       FFT_Type fft_type = CC
 464
 465       Charm_createFFT(N_x, N_y, N_z, z_x, z_y, y_x, y_z, x_yz, cutoff, hmati,
 466                       fft_type, CkCallback(CkIndex_Driver::proxyCreated(NULL), driverProxy));
 467   }
 468
 469   Driver::proxyCreated(idMsg *msg) {
 470       CProxy_fft2d fftProxy = msg->id;
 471       delete msg;
 472   }
 473
 474 In this example, an entry method *Driver::proxyCreated* will be called
 475 when an FFT instance has been created.
 476
 477 Using the newly received proxy, the user can identify whether a local PE
 478 has XPencils and/or ZPencils.
 479
 480 .. code-block:: c++
 481
 482        void Driver::proxyCreated(idMsg *msg) {
 483          CProxy_fft2d fftProxy = msg->id;
 484
 485          delete msg;
 486
 487          bool hasX = Charm_isOutputPE(fftProxy),
 488               hasZ = Charm_isInputPE(fftProxy);
 489
 490          ...
 491        }
 492
 493 Then, the grid’s dimensions on a PE can be acquired by using
 494 *Charm_getOutputExtents* and *Charm_getInputExtents*.
 495
 496 .. code-block:: c++
 497
 498        if (hasX) {
 499          Charm_getOutputExtents(gridStart[MY_X], gridEnd[MY_X],
 500                                gridStart[MY_Y], gridEnd[MY_Y],
 501                                gridStart[MY_Z], gridEnd[MY_Z],
 502                                fftProxy);
 503        }
 504
 505        if (hasZ) {
 506          Charm_getInputExtents(gridStart[MY_X], gridEnd[MY_X],
 507                                gridStart[MY_Y], gridEnd[MY_Y],
 508                                gridStart[MY_Z], gridEnd[MY_Z],
 509                                fftProxy);
 510        }
 511
 512        for(int i = 0; i < 3; i++) {
 513          gridLength[i] = gridEnd[i] - gridStart[i];
 514        }
 515
 516 With the grid’s dimension, the user must allocate and set the input and
 517 output buffers. In most cases, this is simply the product of the three
 518 dimensions, but for real-to-complex FFT calcaultion, FFTW-style storage
 519 for the input buffers is used (as shown below).
 520
 521 .. code-block:: c++
 522
 523        dataSize = gridLength[MY_X] * gridLength[MY_Y] * gridLength[MY_Z];
 524
 525        if (hasX) {
 526          dataOut = (complex*) fftw_malloc(dataSize * sizeof(complex));
 527
 528          Charm_setOutputMemory((void*) dataOut, fftProxy);
 529        }
 530
 531        if (hasZ) {
 532          if (fftType == RC) {
 533            // FFTW style storage
 534            dataSize = gridLength[MY_X] * gridLength[MY_Y] * (gridLength[MY_Z]/2 + 1);
 535          }
 536
 537          dataIn = (complex*) fftw_malloc(dataSize * sizeof(complex));
 538
 539          Charm_setInputMemory((void*) dataIn, fftProxy);
 540        }
 541
 542 Then, from *PE0*, start the forward or backward FFT, setting the entry
 543 method *fftDone* as the callback function that will be called when the
 544 FFT operation is complete.
 545
 546 For forward FFT
 547
 548 .. code-block:: c++
 549
 550        if (CkMyPe() == 0) {
 551            Charm_doForwardFFT(CkCallback(CkIndex_Driver::fftDone(), thisProxy), fftProxy);
 552        }
 553
 554 For backward FFT
 555
 556 .. code-block:: c++
 557
 558        if (CkMyPe() == 0) {
 559            Charm_doBackwardFFT(CkCallback(CkIndex_Driver::fftDone(), thisProxy), fftProxy);
 560        }
 561
 562 The sample program to run a backward FFT can be found in
 563 *Your_Charm_FFT_Path/tests/simple_tests*
 564
 565
 566 TRAM
 567 ====
 568
 569 Overview
 570 --------
 571
 572 Topological Routing and Aggregation Module is a library for optimization
 573 of many-to-many and all-to-all collective communication patterns in
 574 Charm++ applications. The library performs topological routing and
 575 aggregation of network communication in the context of a virtual grid
 576 topology comprising the Charm++ Processing Elements (PEs) in the
 577 parallel run. The number of dimensions and their sizes within this
 578 topology are specified by the user when initializing an instance of the
 579 library.
 580
 581 TRAM is implemented as a Charm++ group, so an *instance* of TRAM has one
 582 object on every PE used in the run. We use the term *local instance* to
 583 denote a member of the TRAM group on a particular PE.
 584
 585 Most collective communication patterns involve sending linear arrays of
 586 a single data type. In order to more efficiently aggregate and process
 587 data, TRAM restricts the data sent using the library to a single data
 588 type specified by the user through a template parameter when
 589 initializing an instance of the library. We use the term *data item* to
 590 denote a single object of this datatype submitted to the library for
 591 sending. While the library is active (i.e. after initialization and
 592 before termination), an arbitrary number of data items can be submitted
 593 to the library at each PE.
 594
 595 On systems with an underlying grid or torus network topology, it can be
 596 beneficial to configure the virtual topology for TRAM to match the
 597 physical topology of the network. This can easily be accomplished using
 598 the Charm++ Topology Manager.
 599
 600 The next two sections explain the routing and aggregation techniques
 601 used in the library.
 602
 603 Routing
 604 ~~~~~~~
 605
 606 Let the variables :math:`j` and :math:`k` denote PEs within an
 607 N-dimensional virtual topology of PEs and :math:`x` denote a dimension
 608 of the grid. We represent the coordinates of :math:`j` and :math:`k`
 609 within the grid as :math:`\left
 610 (j_0, j_1, \ldots, j_{N-1} \right)` and :math:`\left (k_0, k_1, \ldots,
 611 k_{N-1} \right)`. Also, let
 612
 613 .. math::
 614
 615    f(x, j, k) =
 616    \begin{cases}
 617    0, & \text{if } j_x = k_x \\
 618    1, & \text{if } j_x \ne k_x
 619    \end{cases}
 620
 621 :math:`j` and :math:`k` are *peers* if
 622
 623 .. math:: \sum_{d=0}^{N-1} f(d, j, k) = 1 .
 624
 625 When using TRAM, PEs communicate directly only with their peers. Sending
 626 to a PE which is not a peer is handled inside the library by routing the
 627 data through one or more *intermediate destinations* along the route to
 628 the *final destination*.
 629
 630 Suppose a data item destined for PE :math:`k` is submitted to the
 631 library at PE :math:`j`. If :math:`k` is a peer of :math:`j`, the data
 632 item will be sent directly to :math:`k`, possibly along with other data
 633 items for which :math:`k` is the final or intermediate destination. If
 634 :math:`k` is not a peer of :math:`j`, the data item will be sent to an
 635 intermediate destination :math:`m` along the route to :math:`k` whose
 636 index is :math:`\left (j_0, j_1, \ldots, j_{i-1}, k_i,
 637 j_{i+1}, \ldots, j_{N-1} \right)`, where :math:`i` is the greatest value
 638 of :math:`x` for which :math:`f(x, j, k) = 1`.
 639
 640 Note that in obtaining the coordinates of :math:`m` from :math:`j`,
 641 exactly one of the coordinates of :math:`j` which differs from the
 642 coordinates of :math:`k` is made to agree with :math:`k`. It follows
 643 that m is a peer of :math:`j`, and that using this routing process at
 644 :math:`m` and every subsequent intermediate destination along the route
 645 eventually leads to the data item being received at :math:`k`.
 646 Consequently, the number of messages :math:`F(j, k)` that will carry the
 647 data item to the destination is
 648
 649 .. math:: F(j,k) = \sum_{d=0}^{N-1}f(d, j, k) .
 650
 651 Aggregation
 652 ~~~~~~~~~~~
 653
 654 Communicating over the network of a parallel machine involves per
 655 message bandwidth and processing overhead. TRAM amortizes this overhead
 656 by aggregating data items at the source and every intermediate
 657 destination along the route to the final destination.
 658
 659 Every local instance of the TRAM group buffers the data items that have
 660 been submitted locally or received from another PE for forwarding.
 661 Because only peers communicate directly in the virtual grid, it suffices
 662 to have a single buffer per PE for every peer. Given a dimension d
 663 within the virtual topology, let :math:`s_d` denote its *size*, or the
 664 number of distinct values a coordinate for dimension d can take.
 665 Consequently, each local instance allocates up to :math:`s_d - 1`
 666 buffers per dimension, for a total of :math:`\sum_{d=0}^{N-1} (s_d - 1)`
 667 buffers. Note that this is normally significantly less than the total
 668 number of PEs specified by the virtual topology, which is equal to
 669 :math:`\prod_{d=0}^{N-1}
 670 {s_d}`.
 671
 672 Sending with TRAM is done by submitting a data item and a destination
 673 identifier, either PE or array index, using a function call to the local
 674 instance. If the index belongs to a peer, the library places the data
 675 item in the buffer for the peer’s PE. Otherwise, the library calculates
 676 the index of the intermediate destination using the previously described
 677 algorithm, and places the data item in the buffer for the resulting PE,
 678 which by design is always a peer of the local PE. Buffers are sent out
 679 immediately when they become full. When a message is received at an
 680 intermediate destination, the data items comprising it are distributed
 681 into the appropriate buffers for subsequent sending. In the process, if
 682 a data item is determined to have reached its final destination, it is
 683 immediately delivered.
 684
 685 The total buffering capacity specified by the user may be reached even
 686 when no single buffer is completely filled up. In that case the buffer
 687 with the greatest number of buffered data items is sent.
 688
 689 Application User Interface
 690 --------------------------
 691
 692 A typical usage scenario for TRAM involves a start-up phase followed by
 693 one or more *communication steps*. We next describe the application user
 694 interface and details relevant to usage of the library, which normally
 695 follows these steps:
 696
 697 #. Start-up Creation of a TRAM group and set up of client arrays and
 698    groups
 699
 700 #. Initialization Calling an initialization function, which returns
 701    through a callback
 702
 703 #. Sending An arbitrary number of sends using the insertData function
 704    call on the local instance of the library
 705
 706 #. Receiving Processing received data items through the process function
 707    which serves as the delivery interface for the library and must be
 708    defined by the user
 709
 710 #. Termination Termination of a communication step
 711
 712 #. Re-initialization After termination of a communication step, the
 713    library instance is not active. However, re-initialization using step
 714    :math:`2` leads to a new communication step.
 715
 716 Start-Up
 717 ~~~~~~~~
 718
 719 Start-up is typically performed once in a program, often inside the main
 720 function of the mainchare, and involves creating an aggregator instance.
 721 An instance of TRAM is restricted to sending data items of a single
 722 user-specified type, which we denote by dtype, to a single
 723 user-specified chare array or group.
 724
 725 Sending to a Group
 726 ^^^^^^^^^^^^^^^^^^
 727
 728 To use TRAM for sending to a group, a GroupMeshStreamer group should be
 729 created. Either of the following two GroupMeshStreamer constructors can
 730 be used for that purpose:
 731
 732 .. code-block:: c++
 733
 734    template<class dtype, class ClientType, class RouterType>
 735    GroupMeshStreamer<dtype, ClientType, RouterType>::
 736    GroupMeshStreamer(int maxNumDataItemsBuffered,
 737                      int numDimensions,
 738                      int *dimensionSizes,
 739                      CkGroupID clientGID,
 740                      bool yieldFlag = 0,
 741                      double progressPeriodInMs = -1.0);
 742
 743    template<class dtype, class ClientType, class RouterType>
 744    GroupMeshStreamer<dtype, ClientType, RouterType>::
 745    GroupMeshStreamer(int numDimensions,
 746                      int *dimensionSizes,
 747                      CkGroupID clientGID,
 748                      int bufferSize,
 749                      bool yieldFlag = 0,
 750                      double progressPeriodInMs = -1.0);
 751
 752 Sending to a Chare Array
 753 ^^^^^^^^^^^^^^^^^^^^^^^^
 754
 755 For sending to a chare array, an ArrayMeshStreamer group should be
 756 created, which has a similar constructor interface to GroupMeshStreamer:
 757
 758 .. code-block:: c++
 759
 760    template <class dtype, class itype, class ClientType,
 761              class RouterType>
 762    ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 763    ArrayMeshStreamer(int maxNumDataItemsBuffered,
 764                      int numDimensions,
 765                      int *dimensionSizes,
 766                      CkArrayID clientAID,
 767                      bool yieldFlag = 0,
 768                      double progressPeriodInMs = -1.0);
 769
 770    template <class dtype, class itype, class ClientType,
 771              class RouterType>
 772    ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 773    ArrayMeshStreamer(int numDimensions,
 774                      int *dimensionSizes,
 775                      CkArrayID clientAID,
 776                      int bufferSize,
 777                      bool yieldFlag = 0,
 778                      double progressPeriodInMs = -1.0);
 779
 780 Description of parameters:
 781
 782 -  maxNumDataItemsBuffered: maximum number of items that the library is
 783    allowed to buffer per PE
 784
 785 -  numDimensions: number of dimensions in grid of PEs
 786
 787 -  dimensionSizes: array of size numDimensions containing the size of
 788    each dimension in the grid
 789
 790 -  clientGID: the group ID for the client group
 791
 792 -  clientAID: the array ID for the client array
 793
 794 -  bufferSize: size of the buffer for each peer, in terms of number of
 795    data items
 796
 797 -  yieldFlag: when true, calls CthYield() after every :math:`1024` item
 798    insertions; setting it true requires all data items to be submitted
 799    from threaded entry methods. Ensures that pending messages are sent
 800    out by the runtime system when a large number of data items are
 801    submitted from a single entry method.
 802
 803 -  progressPeriodInMs: number of milliseconds between periodic progress
 804    checks; relevant only when periodic flushing is enabled (see
 805    Section :numref:`sec:tram_termination`).
 806
 807 Template parameters:
 808
 809 -  dtype: data item type
 810
 811 -  itype: index type of client chare array (use int for one-dimensional
 812    chare arrays and CkArrayIndex for all other index types)
 813
 814 -  ClientType: type of client group or array
 815
 816 -  | RouterType: the routing protocol to be used. The choices are:
 817    | (1) SimpleMeshRouter - original grid aggregation scheme;
 818    | (2) NodeAwareMeshRouter - base node-aware aggregation scheme;
 819    | (3) AggressiveNodeAwareMeshRouter - advanced node-aware aggregation
 820      scheme;
 821
 822 Initialization
 823 ~~~~~~~~~~~~~~
 824
 825 A TRAM instance needs to be initialized before every communication step.
 826 There are currently three main modes of operation, depending on the type
 827 of termination used: *staged completion*, *completion detection*, or
 828 *quiescence detection*. The modes of termination are described later.
 829 Here, we present the interface for initializing a communication step for
 830 each of the three modes.
 831
 832 When using completion detection, each local instance of TRAM must be
 833 initialized using the following variant of the overloaded init function:
 834
 835 .. code-block:: c++
 836
 837    template <class dtype, class RouterType>
 838    void MeshStreamer<dtype, RouterType>::
 839    init(int numContributors,
 840         CkCallback startCb,
 841         CkCallback endCb,
 842         CProxy_CompletionDetector detector,
 843         int prio,
 844         bool usePeriodicFlushing);
 845
 846 Description of parameters:
 847
 848 -  numContributors: number of done calls expected globally before
 849    termination of this communication step
 850
 851 -  startCb: callback to be invoked by the library after initialization
 852    is complete
 853
 854 -  endCb: callback to be invoked by the library after termination of
 855    this communication step
 856
 857 -  detector: an inactive CompletionDetector object to be used by TRAM
 858
 859 -  prio: Charm++ priority to be used for messages sent using TRAM in
 860    this communication step
 861
 862 -  usePeriodicFlushing: specifies whether periodic flushing should be
 863    used for this communication step
 864
 865 When using staged completion, a completion detector object is not
 866 required as input, as the library performs its own specialized form of
 867 termination. In this case, each local instance of TRAM must be
 868 initialized using a different interface for the overloaded init
 869 function:
 870
 871 .. code-block:: c++
 872
 873    template <class dtype, class RouterType>
 874    void MeshStreamer<dtype, RouterType>::
 875    init(int numLocalContributors,
 876         CkCallback startCb,
 877         CkCallback endCb,
 878         int prio,
 879         bool usePeriodicFlushing);
 880
 881 Note that numLocalContributors denotes the local number of done calls
 882 expected, rather than the global as in the first interface of init.
 883
 884 A common case is to have a single chare array perform all the sends in a
 885 communication step, with each element of the array as a contributor. For
 886 this case there is a special version of init that takes as input the
 887 CkArrayID object for the chare array that will perform the sends,
 888 precluding the need to manually determine the number of client chares
 889 per PE:
 890
 891 .. code-block:: c++
 892
 893    template <class dtype, class RouterType>
 894    void MeshStreamer<dtype, RouterType>::
 895    init(CkArrayID senderArrayID,
 896         CkCallback startCb,
 897         CkCallback endCb,
 898         int prio,
 899         bool usePeriodicFlushing);
 900
 901 The init interface for using quiescence detection is:
 902
 903 .. code-block:: c++
 904
 905    template <class dtype, class RouterType>
 906    void MeshStreamer<dtype, RouterType>::init(CkCallback startCb,
 907                                               int prio);
 908
 909 After initialization is finished, the system invokes startCb,
 910 signaling to the user that the library is ready to accept data items
 911 for sending.
 912
 913 Sending
 914 ~~~~~~~
 915
 916 Sending with TRAM is done through calls to insertData and broadcast.
 917
 918 .. code-block:: c++
 919
 920    template <class dtype, class RouterType>
 921    void MeshStreamer<dtype, RouterType>::
 922    insertData(const dtype& dataItem,
 923               int destinationPe);
 924
 925    template <class dtype, class itype, class ClientType,
 926              class RouterType>
 927    void ArrayMeshStreamer<dtype, itype, ClientType, RouterType>::
 928    insertData(const dtype& dataItem,
 929               itype arrayIndex);
 930
 931    template <class dtype, class RouterType>
 932    void MeshStreamer<dtype, RouterType>::
 933    broadcast(const dtype& dataItem);
 934
 935 -  dataItem: reference to a data item to be sent
 936
 937 -  destinationPe: index of destination PE
 938
 939 -  arrayIndex: index of destination array element
 940
 941 Broadcasting has the effect of delivering the data item:
 942
 943 -  once on every PE involved in the computation for GroupMeshStreamer
 944
 945 -  once for every array element involved in the computation for
 946    ArrayMeshStreamer
 947
 948 Receiving
 949 ~~~~~~~~~
 950
 951 To receive data items sent using TRAM, the user must define the process
 952 function for each client group and array:
 953
 954 .. code-block:: c++
 955
 956    void process(const dtype &ran);
 957
 958 Each item is delivered by the library using a separate call to process
 959 on the destination PE. The call is made locally, so process should not
 960 be an entry method.
 961
 962 .. _sec:tram_termination:
 963
 964 Termination
 965 ~~~~~~~~~~~
 966
 967 Flushing and termination mechanisms are used in TRAM to prevent deadlock
 968 due to indefinite buffering of items. Flushing works by sending out all
 969 buffers in a local instance if no items have been submitted or received
 970 since the last progress check. Meanwhile, termination detection is used
 971 to send out partially filled buffers at the end of a communication step
 972 after it has been determined that no additional items will be submitted.
 973
 974 Currently, three means of termination are supported: staged completion,
 975 completion detection, and quiescence detection. Periodic flushing is a
 976 secondary mechanism which can be enabled or disabled when initiating one
 977 of the primary mechanisms.
 978
 979 Termination typically requires the user to issue a number of calls to
 980 the done function:
 981
 982 .. code-block:: c++
 983
 984    template <class dtype, class RouterType>
 985    void MeshStreamer<dtype, RouterType>::
 986    done(int numContributorsFinished = 1);
 987
 988 When using completion detection, the number of done calls that are
 989 expected globally by the TRAM instance is specified using the
 990 numContributors parameter to init. Safe termination requires that no
 991 calls to insertData or broadcast are made after the last call to done is
 992 performed globally. Because order of execution is uncertain in parallel
 993 applications, some care is required to ensure the above condition is
 994 met. A simple way to terminate safely is to set numContributors equal to
 995 the number of senders, and call done once for each sender that is done
 996 submitting items.
 997
 998 In contrast to using completion detection, using staged completion
 999 involves setting the local number of expected calls to done using the
1000 numLocalContributors parameter in the init function. To ensure safe
1001 termination, no insertData or broadcast calls should be made on any PE
1002 where done has been called the expected number of times.
1003
1004 Another version of init for staged completion, which takes a CkArrayID
1005 object as an argument, provides a simplified interface in the common
1006 case when a single chare array performs all the sends within a
1007 communication step, with each of its elements as a contributor. For this
1008 version of init, TRAM determines the appropriate number of local
1009 contributors automatically. It also correctly handles the case of PEs
1010 without any contributors by immediately marking those PEs as having
1011 finished the communication step. As such, this version of init should be
1012 preferred by the user when applicable.
1013
1014 Staged completion is not supported when array location data is not
1015 guaranteed to be correct, as this can potentially violate the
1016 termination conditions used to guarantee successful termination. In
1017 order to guarantee correct location data in applications that use load
1018 balancing, Charm++ must be compiled with -DCMKGLOBALLOCATIONUPDATE,
1019 which has the effect of performing a global broadcast of location data
1020 for chare array elements that migrate during load balancing.
1021 Unfortunately, this operation is expensive when migrating large numbers
1022 of elements. As an alternative, completion detection and quiescence
1023 detection modes will work properly without the global location update
1024 mechanism, and even in the case of anytime migration.
1025
1026 When using quiescence detection, no end callback is used, and no done
1027 calls are required. Instead, termination of a communication step is
1028 achieved using the quiescence detection framework in Charm++, which
1029 supports passing a callback as parameter. TRAM is set up such that
1030 quiescence will not be detected until all items sent in the current
1031 communication step have been delivered to their final destinations.
1032
1033 The choice of which termination mechanism to use is left to the user.
1034 Using completion detection mode is more convenient when the global
1035 number of contributors is known, while staged completion is easier to
1036 use if the local number of contributors can be determined with ease, or
1037 if sending is done from the elements of a chare array. If either mode
1038 can be used with ease, staged completion should be preferred. Unlike the
1039 other mechanisms, staged completion does not involve persistent
1040 background communication to determine when the global number of expected
1041 done calls is reached. Staged completion is also generally faster at
1042 reaching termination due to not being dependent on periodic progress
1043 checks. Unlike completion detection, staged completion does incur a
1044 small bandwidth overhead (:math:`4` bytes) for every TRAM message, but
1045 in practice this is more than offset by the persistent traffic incurred
1046 by completion detection.
1047
1048 Periodic flushing is an auxiliary mechanism which checks at a regular
1049 interval whether any sends have taken place since the last time the
1050 check was performed. If not, the mechanism sends out all the data items
1051 buffered per local instance of the library. The period is specified by
1052 the user in the TRAM constructor. A typical use case for periodic
1053 flushing is when the submission of a data item B to TRAM happens as a
1054 result of the delivery of another data item A sent using the same TRAM
1055 instance. If A is buffered inside the library and insufficient data
1056 items are submitted to cause the buffer holding A to be sent out, a
1057 deadlock could arise. With the periodic flushing mechanism, the buffer
1058 holding A is guaranteed to be sent out eventually, and deadlock is
1059 prevented. Periodic flushing is required when using the completion
1060 detection or quiescence detection termination modes.
1061
1062 Re-initialization
1063 ~~~~~~~~~~~~~~~~~
1064
1065 A TRAM instance that has terminated cannot be used for sending more data
1066 items until it has been re-initialized. Re-initialization is achieved by
1067 calling init, which prepares the instance of the library for a new
1068 communication step. Re-initialization is useful for iterative
1069 applications, where it is often convenient to have a single
1070 communication step per iteration of the application.
1071
1072 Charm++ Registration of Templated Classes
1073 ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
1074
1075 Due to the use of templates in TRAM, the library template instances must
1076 be explicitly registered with the Charm++ runtime by the user of the
1077 library. This must be done in the .ci file for the application, and
1078 typically involves three steps.
1079
1080 For GroupMeshStreamer template instances, registration is done as
1081 follows:
1082
1083 -  Registration of the message type:
1084
1085    .. code-block:: c++
1086
1087       message MeshStreamerMessage<dtype>;
1088
1089 -  Registration of the base aggregator class
1090
1091    .. code-block:: c++
1092
1093       group MeshStreamer<dtype, RouterType>;
1094
1095 -  Registration of the derived aggregator class
1096
1097    .. code-block:: c++
1098
1099       group GroupMeshStreamer<dtype, ClientType, RouterType>;
1100
1101 For ArrayMeshStreamer template instances, registration is done as
1102 follows:
1103
1104 -  Registration of the message type:
1105
1106    .. code-block:: c++
1107
1108       message MeshStreamerMessage<ArrayDataItem<dtype, itype> >;
1109
1110 -  Registration of the base aggregator class
1111
1112    .. code-block:: c++
1113
1114       group MeshStreamer<ArrayDataItem<dtype, itype>,
1115                          RouterType>;
1116
1117 -  Registration of the derived aggregator class
1118
1119    .. code-block:: c++
1120
1121       group ArrayMeshStreamer<dtype, itype, ClientType,
1122                               RouterType>;
1123
1124 Example
1125 -------
1126
1127 For example code showing how to use TRAM, see ``examples/charm++/TRAM`` and
1128 ``benchmarks/charm++/streamingAllToAll`` in the Charm++ repository.
1129
1130 .. _gpumanager:
1131
1132 GPU Manager Library
1133 ===================
1134
1135 .. _overview-1:
1136
1137 Overview
1138 --------
1139
1140 GPU Manager is a task offload and management library for efficient use
1141 of CUDA-enabled GPUs in Charm++ applications. CUDA code can be
1142 integrated in Charm++ just like any C program, but the resulting
1143 performance is likely to be far from ideal. This is because
1144 overdecomposition, a core concept of Charm++, creates fine-grained
1145 objects and tasks which causes problems on the GPU.
1146
1147 GPUs are throughput-oriented devices with peak computational
1148 capabilities that greatly surpass equivalent-generation CPUs but with
1149 limited control logic. This currently constrains them to be used as
1150 accelerator devices controlled by code on the CPU. Traditionally,
1151 programmers have had to either (a) halt the execution of work on the CPU
1152 whenever issuing GPU work to simplify synchronization or (b) issue GPU
1153 work asynchronously and carefully manage and synchronize concurrent GPU
1154 work in order to ensure progress and good performance. The latter
1155 option, which is practically a requirement in Charm++ to preserve
1156 asynchrony, becomes significantly more difficult with numerous
1157 concurrent objects that issue kernels and data transfers to the GPU.
1158
1159 The Charm++ programmer is strongly recommended to use CUDA streams to
1160 mitigate this problem, by assigning separate streams to chares. This
1161 allows operations in different streams to execute concurrently. It
1162 should be noted that concurrent data transfers are limited by the number
1163 of DMA engines, and current GPUs have one per direction of the transfer
1164 (host-to-device, device-to-host). The concurrent kernels feature of CUDA
1165 allows multiple kernels to execute simultaneously on the device, as long
1166 as resources are available.
1167
1168 An important factor of performance with using GPUs in Charm++ is that
1169 the CUDA API calls invoked by chares to offload work should be
1170 non-blocking. The chare that just offloaded work to the GPU should yield
1171 the PE so that other chares waiting to be executed can do so.
1172 Unfortunately, many CUDA API calls used to wait for completion of GPU
1173 work, such as ``cudaStreamSynchronize`` and ``cudaDeviceSynchronize``,
1174 are blocking. Since the PEs in Charm++ are implemented as persistent
1175 kernel-level threads mapped to each CPU core, this means other chares
1176 cannot run until the GPU work completes and the blocked chare finishes
1177 executing. To resolve this issue, GPU Manager provides Hybrid API (HAPI)
1178 to the Charm++ user, which includes new functions to implement the
1179 non-blocking features and a set of wrappers to the CUDA runtime API
1180 functions. The non-blocking API allows the user to specify a Charm++
1181 callback upon offload which will be invoked when the operations in the
1182 CUDA stream are complete.
1183
1184 Building GPU Manager
1185 --------------------
1186
1187 GPU Manager is not included by default when building Charm++. In order
1188 to use GPU Manager, the user must build Charm++ using the ``cuda``
1189 option, e.g.
1190
1191 .. code-block:: bash
1192
1193    $ ./build charm++ netlrts-linux-x86_64 cuda -j8
1194
1195 Building GPU Manager requires an installation of the CUDA toolkit on the
1196 system.
1197
1198 Using GPU Manager
1199 -----------------
1200
1201 As explained in the Overview section, use of CUDA streams is strongly
1202 recommended. This allows kernels offloaded by chares to execute
1203 simultaneously on the GPU, which boosts performance if the kernels are
1204 small enough for the GPU to be able to allocate resources.
1205
1206 In a typical Charm++ application using CUDA, ``.C`` and ``.ci`` files
1207 would contain the Charm++ code, whereas a ``.cu`` file would include the
1208 definition of CUDA kernels and a function that serves as an entry point
1209 from the Charm++ application to use GPU capabilities. CUDA/HAPI calls
1210 for data transfers or kernel invocations would be placed inside this
1211 function, although they could also be put in a ``.C`` file provided that
1212 the right header files are included (``<cuda_runtime.h> or "hapi.h"``).
1213 The user should make sure that the CUDA kernel definitions are compiled
1214 by ``nvcc``, however.
1215
1216 After the necessary data transfers and kernel invocations,
1217 ``hapiAddCallback`` would be placed where typically
1218 ``cudaStreamSynchronize`` or ``cudaDeviceSynchronize`` would go. This
1219 informs the runtime that a chare has offloaded work to the GPU, allowing
1220 the provided Charm++ callback to be invoked once it is complete. The
1221 non-blocking API has the following prototype:
1222
1223 .. code-block:: c++
1224
1225      void hapiAddCallback(cudaStream_t stream, CkCallback* callback);
1226
1227 Other HAPI calls:
1228
1229 .. code-block:: c++
1230
1231      void hapiCreateStreams();
1232      cudaStream_t hapiGetStream();
1233
1234      cudaError_t hapiMalloc(void** devPtr, size_t size);
1235      cudaError_t hapiFree(void* devPtr);
1236      cudaError_t hapiMallocHost(void** ptr, size_t size);
1237      cudaError_t hapiFreeHost(void* ptr);
1238
1239      void* hapiPoolMalloc(int size);
1240      void hapiPoolFree(void* ptr);
1241
1242      cudaError_t hapiMemcpyAsync(void* dst, const void* src, size_t count,
1243                                  cudaMemcpyKind kind, cudaStream_t stream = 0);
1244
1245      hapiCheck(code);
1246
1247 ``hapiCreateStreams`` creates as many streams as the maximum number of
1248 concurrent kernels supported by the GPU device. ``hapiGetStream`` hands
1249 out a stream created by the runtime in a round-robin fashion. The
1250 ``hapiMalloc`` and ``hapiFree`` functions are wrappers to the
1251 corresponding CUDA API calls, and ``hapiPool`` functions provides memory
1252 pool functionalities which are used to obtain/free device memory without
1253 interrupting the GPU. ``hapiCheck`` is used to check if the input code
1254 block executes without errors. The given code should return
1255 ``cudaError_t`` for it to work.
1256
1257 Example Charm++ applications using CUDA can be found under
1258 ``examples/charm++/cuda``. Codes under #ifdef USE_WR use the
1259 hapiWorkRequest scheme, which is now deprecated.