Cleanup #1491: Update documentation of GPUManager
[charm.git] / doc / libraries / gpumanager.tex
blob158ed364f359c5f915d96a88389f15d3c2709628
1 \setcounter{secnumdepth}{3}
3 \renewcommand{\code}[1]{\texttt{\textbf{#1}}}
5 \newcommand{\cuda}{\code{CUDA}}
7 \section{Overview}
9 GPU Manager is a task offload and management library for efficient use of
10 CUDA-enabled GPUs in \charmpp{} applications. CUDA code can be integrated
11 in \charmpp{} just like any \CC{} program, but the resulting performance
12 is likely to be far from ideal. This is because overdecomposition, a core
13 concept of \charmpp{}, creates fine-grained objects and tasks which causes
14 problems on the GPU.
16 GPUs are throughput-oriented devices with peak computational capabilities that
17 greatly surpass equivalent-generation CPUs but with limited control logic. This
18 currently constrains them to be used as accelerator devices controlled by code
19 on the CPU. Traditionally, programmers have had to either (a) halt the execution
20 of work on the CPU whenever issuing GPU work to simplify synchronization or (b)
21 issue GPU work asynchronously and carefully manage and synchronize concurrent
22 GPU work in order to ensure progress and good performance. The latter option,
23 which is practically a requirement in \charmpp{} to preserve asynchrony, becomes
24 significantly more difficult with numerous concurrent objects that issue kernels
25 and data transfers to the GPU.
27 The \charmpp{} programmer is strongly recommended to use CUDA streams to
28 mitigate this problem, by assigning separate streams to chares. This allows
29 operations in different streams to execute concurrently. It should be noted that
30 concurrent data transfers are limited by the number of DMA engines, and current
31 GPUs have one per direction of the transfer (host-to-device, device-to-host).
32 The concurrent kernels feature of CUDA allows multiple kernels to execute
33 simultaneously on the device, as long as resources are available.
35 An important factor of performance with using GPUs in \charmpp{} is that the CUDA
36 API calls invoked by chares to offload work should be non-blocking. The chare
37 that just offloaded work to the GPU should yield the PE so that other chares
38 waiting to be executed can do so. Unfortunately, many CUDA API calls used to
39 wait for completion of GPU work, such as \code{cudaStreamSynchronize} and
40 \code{cudaDeviceSynchronize}, are blocking. Since the PEs in \charmpp{} are
41 implemented as persistent kernel-level threads mapped to each CPU core, this
42 means other chares cannot run until the GPU work completes and the blocked
43 chare finishes executing. To resolve this issue, GPU Manager provides Hybrid API
44 (HAPI) to the \charmpp{} user, which includes new functions to implement the
45 non-blocking features and a set of wrappers to the CUDA runtime API functions.
46 The non-blocking API allows the user to specify a \charmpp{} callback upon offload
47 which will be invoked when the operations in the CUDA stream are complete.
49 \section{Building GPU Manager}
51 GPU Manager is not included by default when building \charmpp{}. In order to use
52 GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g.
54 \begin{alltt}
55 ./build charm++ netlrts-linux-x86_64 cuda -j8
56 \end{alltt}
58 Building GPU Manager requires an installation of the CUDA toolkit on the system.
60 \section{Using GPU Manager}
62 As explained in the Overview section, use of CUDA streams is strongly recommended.
63 This allows kernels offloaded by chares to execute simultaneously on the GPU,
64 which boosts performance if the kernels are small enough for the GPU to be able
65 to allocate resources.
67 In a typical \charmpp{} application using CUDA, \code{.C} and \code{.ci} files
68 would contain the \charmpp{} code, whereas a \code{.cu} file would include the
69 definition of CUDA kernels and a function that serves as an entry point from
70 the \charmpp{} application to use GPU capabilities. CUDA/HAPI calls for data
71 transfers or kernel invocations would be placed inside this function, although
72 they could also be put in a \code{.C} file provided that the right header files
73 are included (\code{<cuda_runtime.h> or "hapi.h"}). The user should make sure
74 that the CUDA kernel definitions are compiled by \code{nvcc}, however.
76 After the necessary data transfers and kernel invocations, \code{hapiAddCallback}
77 would be placed where typically \code{cudaStreamSynchronize} or
78 \code{cudaDeviceSynchronize} would go. This informs the runtime that a chare has
79 offloaded work to the GPU, allowing the provided \charmpp{} callback to be
80 invoked once it is complete. The non-blocking API has the following prototype:
82 \begin{alltt}
83 void hapiAddCallback(cudaStream_t stream, CkCallback* callback);
84 \end{alltt}
86 Other HAPI calls:
88 \begin{alltt}
89 void hapiCreateStreams();
90 cudaStream_t hapiGetStream();
92 cudaError_t hapiMalloc(void** devPtr, size_t size);
93 cudaError_t hapiFree(void* devPtr);
94 cudaError_t hapiMallocHost(void** ptr, size_t size);
95 cudaError_t hapiFreeHost(void* ptr);
97 void* hapiPoolMalloc(int size);
98 void hapiPoolFree(void* ptr);
100 cudaError_t hapiMemcpyAsync(void* dst, const void* src, size_t count,
101 cudaMemcpyKind kind, cudaStream_t stream = 0);
103 hapiCheck(code);
104 \end{alltt}
106 \code{hapiCreateStreams} creates as many streams as the maximum number of
107 concurrent kernels supported by the GPU device. \code{hapiGetStream} hands
108 out a stream created by the runtime in a round-robin fashion. The
109 \code{hapiMalloc} and \code{hapiFree} functions are wrappers to the corresponding
110 CUDA API calls, and \code{hapiPool} functions provides memory pool functionalities
111 which are used to obtain/free device memory without interrupting the GPU.
112 \code{hapiCheck} is used to check if the input code block executes without errors.
113 The given code should return \code{cudaError_t} for it to work.
115 Example \charmpp{} applications using CUDA can be found under
116 \code{examples/charm++/cuda}. Codes under #ifdef USE_WR use the workRequest
117 scheme, which is now deprecated.