doc/libraries/gpumanager.tex

   1 \setcounter{secnumdepth}{3}
   2
   3 \renewcommand{\code}[1]{\texttt{\textbf{#1}}}
   4
   5 \newcommand{\cuda}{\code{CUDA}}
   6
   7 \section{Overview}
   8
   9 GPU Manager is a task offload and management library for efficient use of
  10 CUDA-enabled GPUs in \charmpp{} applications. CUDA code can be integrated
  11 in \charmpp{} just like any \CC{} program, but the resulting performance
  12 is likely to be far from ideal. This is because overdecomposition, a core
  13 concept of \charmpp{}, creates fine-grained objects and tasks which causes
  14 problems on the GPU.
  15
  16 GPUs are throughput-oriented devices with peak computational capabilities that
  17 greatly surpass equivalent-generation CPUs but with limited control logic. This
  18 currently constrains them to be used as accelerator devices controlled by code
  19 on the CPU. Traditionally, programmers have had to either (a) halt the execution
  20 of work on the CPU whenever issuing GPU work to simplify synchronization or (b)
  21 issue GPU work asynchronously and carefully manage and synchronize concurrent
  22 GPU work in order to ensure progress and good performance. The latter option,
  23 which is practically a requirement in \charmpp{} to preserve asynchrony, becomes
  24 significantly more difficult with numerous concurrent objects that issue kernels
  25 and data transfers to the GPU.
  26
  27 The \charmpp{} programmer is strongly recommended to use CUDA streams to
  28 mitigate this problem, by assigning separate streams to chares. This allows
  29 operations in different streams to execute concurrently. It should be noted that
  30 concurrent data transfers are limited by the number of DMA engines, and current
  31 GPUs have one per direction of the transfer (host-to-device, device-to-host).
  32 The concurrent kernels feature of CUDA allows multiple kernels to execute
  33 simultaneously on the device, as long as resources are available.
  34
  35 An important factor of performance with using GPUs in \charmpp{} is that the CUDA
  36 API calls invoked by chares to offload work should be non-blocking. The chare
  37 that just offloaded work to the GPU should yield the PE so that other chares
  38 waiting to be executed can do so. Unfortunately, many CUDA API calls used to
  39 wait for completion of GPU work, such as \code{cudaStreamSynchronize} and
  40 \code{cudaDeviceSynchronize}, are blocking. Since the PEs in \charmpp{} are
  41 implemented as persistent kernel-level threads mapped to each CPU core, this
  42 means other chares cannot run until the GPU work completes and the blocked
  43 chare finishes executing. To resolve this issue, GPU Manager provides Hybrid API
  44 (HAPI) to the \charmpp{} user, which includes new functions to implement the
  45 non-blocking features and a set of wrappers to the CUDA runtime API functions.
  46 The non-blocking API allows the user to specify a \charmpp{} callback upon offload
  47 which will be invoked when the operations in the CUDA stream are complete.
  48
  49 \section{Building GPU Manager}
  50
  51 GPU Manager is not included by default when building \charmpp{}. In order to use
  52 GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g.
  53
  54 \begin{alltt}
  55 ./build charm++ netlrts-linux-x86_64 cuda -j8
  56 \end{alltt}
  57
  58 Building GPU Manager requires an installation of the CUDA toolkit on the system.
  59
  60 \section{Using GPU Manager}
  61
  62 As explained in the Overview section, use of CUDA streams is strongly recommended.
  63 This allows kernels offloaded by chares to execute simultaneously on the GPU,
  64 which boosts performance if the kernels are small enough for the GPU to be able
  65 to allocate resources.
  66
  67 In a typical \charmpp{} application using CUDA, \code{.C} and \code{.ci} files
  68 would contain the \charmpp{} code, whereas a \code{.cu} file would include the
  69 definition of CUDA kernels and a function that serves as an entry point from
  70 the \charmpp{} application to use GPU capabilities. CUDA/HAPI calls for data
  71 transfers or kernel invocations would be placed inside this function, although
  72 they could also be put in a \code{.C} file provided that the right header files
  73 are included (\code{<cuda_runtime.h> or "hapi.h"}). The user should make sure
  74 that the CUDA kernel definitions are compiled by \code{nvcc}, however.
  75
  76 After the necessary data transfers and kernel invocations, \code{hapiAddCallback}
  77 would be placed where typically \code{cudaStreamSynchronize} or
  78 \code{cudaDeviceSynchronize} would go. This informs the runtime that a chare has
  79 offloaded work to the GPU, allowing the provided \charmpp{} callback to be
  80 invoked once it is complete. The non-blocking API has the following prototype:
  81
  82 \begin{alltt}
  83   void hapiAddCallback(cudaStream_t stream, CkCallback* callback);
  84 \end{alltt}
  85
  86 Other HAPI calls:
  87
  88 \begin{alltt}
  89   void hapiCreateStreams();
  90   cudaStream_t hapiGetStream();
  91
  92   cudaError_t hapiMalloc(void** devPtr, size_t size);
  93   cudaError_t hapiFree(void* devPtr);
  94   cudaError_t hapiMallocHost(void** ptr, size_t size);
  95   cudaError_t hapiFreeHost(void* ptr);
  96
  97   void* hapiPoolMalloc(int size);
  98   void hapiPoolFree(void* ptr);
  99
 100   cudaError_t hapiMemcpyAsync(void* dst, const void* src, size_t count,
 101                               cudaMemcpyKind kind, cudaStream_t stream = 0);
 102
 103   hapiCheck(code);
 104 \end{alltt}
 105
 106 \code{hapiCreateStreams} creates as many streams as the maximum number of
 107 concurrent kernels supported by the GPU device. \code{hapiGetStream} hands
 108 out a stream created by the runtime in a round-robin fashion. The
 109 \code{hapiMalloc} and \code{hapiFree} functions are wrappers to the corresponding
 110 CUDA API calls, and \code{hapiPool} functions provides memory pool functionalities
 111 which are used to obtain/free device memory without interrupting the GPU.
 112 \code{hapiCheck} is used to check if the input code block executes without errors.
 113 The given code should return \code{cudaError_t} for it to work.
 114
 115 Example \charmpp{} applications using CUDA can be found under
 116 \code{examples/charm++/cuda}. Codes under #ifdef USE_WR use the workRequest
 117 scheme, which is now deprecated.