From c925ff9485b53e906d8e88e1d3cc550425a808aa Mon Sep 17 00:00:00 2001
From: Jaemin Choi <jchoi157@illinois.edu>
Date: Mon, 27 Nov 2017 15:53:18 -0600
Subject: [PATCH] Cleanup #1491: Update documentation of GPUManager

Change-Id: I9d1fc90f8556c14b868015fa2fc0ea127c439c3d
---
 doc/libraries/gpumanager.tex | 482 +++++++++++--------------------------------
 1 file changed, 117 insertions(+), 365 deletions(-)
 rewrite doc/libraries/gpumanager.tex (95%)

diff --git a/doc/libraries/gpumanager.tex b/doc/libraries/gpumanager.tex
dissimilarity index 95%
index ce74426e4b..158ed364f3 100644
--- a/doc/libraries/gpumanager.tex
+++ b/doc/libraries/gpumanager.tex
@@ -1,365 +1,117 @@
-\setcounter{secnumdepth}{3}
-
-\renewcommand{\code}[1]{\texttt{\textbf{#1}}}
-
-\newcommand{\cuda}{\code{CUDA}}
-
-GPU Manager is a task offload and management library for efficient use of
-CUDA-enabled GPUs in \charmpp{} applications. Compared to direct use of CUDA
-(through issuing kernel invocation and GPU data transfer calls in user
-code) GPU Manager provides the following advantages:
-\begin{enumerate}
-\item Automatic management and synchronization of tasks
-\item Automatic overlap of data transfer and kernel invocation for concurrent tasks
-\item A simplified work flow mechanism using CkCallback to return to user code after completion of each work request
-\item Reduced synchronization overhead through centralized management of all GPU tasks
-\end{enumerate}
-
-\section{Building GPU Manager}
-
-GPU Manager is not included by default when building \charmpp{}. In order to use
-GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g.
-
-\begin{alltt}
-./build charm++ netlrts-linux-x86_64 cuda -j8
-\end{alltt}
-Building GPU Manager requires an installation of the CUDA toolkit on the system.
-Charm will search \$CUDATOOLKIT\_HOME, /usr/local/cuda, and
-/usr/lib/nvidia-cuda-toolkit, in that order, to locate the CUDA installation.
-
-\section{Overview and Work Flow}
-
-GPUs are throughput-oriented devices with peak computational capabilities that
-greatly surpass equivalent-generation CPUs but with limited control logic that
-constraints them to use as accelerator devices controlled by code executing on
-the CPU.
-
-The GPU's dependence on the CPU for dispatch and synchronization of
-coarse-grained data transfer and kernel execution has traditionally required
-programmers to either (a) halt the execution of work on the CPU whenever issuing
-GPU work to simplify synchronization or (b) issue GPU work asynchronously and
-carefully manage and synchronize concurrent GPU work in order to ensure
-satisfactory progress and good performance. Further, the latter option becomes
-significantly more difficult in the context of a parallel program with numerous
-concurrent objects that all issue kernel and data transfer calls to the same GPU.
-
-The \charmpp{} GPU Manager is a library designed to address this issue by
-automating the management of GPUs. Users of GPU Manager define \emph{work requests}
-that specify the GPU kernel and any data transfer operations required before
-and after completion of the kernel. The system controls the execution of the
-work requests submitted by all the chares on a particular processor. This allows
-it to effectively manage execution of work requests and overlap CPU-GPU data
-transfer with kernel execution. In steady-state operation, GPU Manager overlaps
-kernel execution of one work request with data transfer out of GPU memory for a
-preceding work request and the data transfer into GPU memory for a subsequent
-work request. This approach avoids blocking the CUDA DMA engine by only submitting
-data transfers when they are ready to execute. When using GPU Manager, the user
-does not need to poll for completion of GPU operations. The system manages
-execution of a work request throughout its life cycle and returns control to the
-user upon completion of a work request through a CkCallback object specified by
-the user per work request. Another advantage of using GPU Manager is that the
-system polls only for a handful of currently executing operations, which avoids
-the problem of multiple chares all polling the GPU when using CUDA streams
-directly. GPU Manager has options for recording profiling data for kernel
-execution and data transfer which can be visualized using the \charmpp{}
-Projections profiler.
-
-\subsection{Execution Model and Progress Engine}
-
-Like any \charmpp{} application, programs using GPU Manager typically consist of a
-large number of concurrently executing objects. Each object executes code in
-response to active messages received from some object within the parallel run,
-during which it can send its own active messages or issue one or more work
-requests to the GPU Manager for asynchronous execution. Work requests are always
-submitted to the local GPU Manager instance at the processing element where the
-call is issued. Incoming GPU work requests are simply copied into the GPU
-Manager's scheduling queue, at which point the library returns and the caller
-can continue with other work.
-
-\charmpp{} employs a message driven programming model. This includes a runtime
-system scheduler that is triggered every time a method finishes execution. Under
-typical CPU-only execution the scheduler examines the queue of incoming messages
-and selects one based on priority and location in the queue. In a \cuda{}
-build of \charmpp{}, the scheduler is also programmed to periodically invoke the
-GPU Manager progress engine.
-
-GPU Manager contains a queue of all pending work requests. When its progress
-function is called, GPU Manager determines whether pending GPU work has
-completed since the last time the progress function was called, and whether
-additional work requests can begin executing. A \code{workRequest} undergoes the
-following stages during its execution:
-
-\begin{enumerate}
-\item Device memory allocation and data transfer from host to device
-\item Kernel execution
-\item Data transfer back to host from device
-\item Invocation of a callback function (specified in the \code{workRequest})
-\end{enumerate}
-
-Based on the instructions contained in each work request, the GPU Manager will
-allocate the required buffers in GPU global memory and issue asynchronous CUDA
-data transfer operations directly. In order to execute kernels, the GPU Manager
-calls the \code{runKernel} function that must be defined by the user. This
-function specifies the CUDA kernel call for your work request.
-
-Under steady state execution with multiple concurrent work requests, as
-one \code{workRequest} progresses to the execution stage, GPU Manager will
-initiate the data transfer for the second
-\code{workRequest} in the queue, and so on.
-
-In a typical application, the work request definition,
-kernel run functions, CUDA kernel definitions, and code for submission
-of work requests would all go in a \code{.cu} file that is compiled with
-\code{nvcc} separately from the other files (e.g. \code{.C}, \code{.ci}) in the
-\charmpp{} application.
-We make a function call to \code{createWorkRequest} from a .C file to create
-and enqueue the workRequest.
-The various resulting object files of the application are then to be linked
-together into the final executable.
-
-
-\section{API}
-
-Using GPU Manager involves:
-\begin{enumerate}
-\item Defining CUDA kernels as in a regular CUDA application
-\item Defining work requests and their callback functions
-\item Defining the \code{void runMyKernel(workRequest *wr, cudaStream\_t kernelStream, void **deviceBuffers)} functions, used by the GPU Manager to issue a kernel call based on the kernel identifier defined in the work request
-\item Submitting work requests to the GPU Manager
-\end{enumerate}
-
-\subsection{Work Request}
-\code{workRequest} is a simple structure which contains the necessary parameters
-for CUDA kernel execution along with some additional members for automating
-data transfer between the host and the device.
-A work request consists of the following data members:
-
-\begin{description}
-\item[dim3 dimGrid]- a triple which defines the grid structure for the kernel;
-in the example below \code{dimGrid.x} specifies the number of blocks.
-\code{dimGrid.y} and \code{dimGrid.z} are unused.
-
-\item[dim3 dimBlock]- a triple defining each block's structure;
-specifies the number of threads in up to three dimensions.
-
-\item[int smemSize]- the number of bytes in shared memory to be dynamically
-allocated per block for kernel execution.
-
-\item[int nBuffers]- number of buffers used by the work request.
-
-\item[dataInfo *bufferInfo]- array of \code{dataInfo} structs containing buffer
-information for the execution of the work request. This array must be of size \code{nBuffers}, e.g.
-\begin{alltt}
-code{wr->bufferInfo = (dataInfo *) malloc(wr->nBuffers * sizeof(dataInfo))}
-\end{alltt}
-We explain the contents of \code{dataInfo} struct below.
-
-\item[void *callbackFn]- a pointer to a \code{CkCallback} object specified by the user;
-executed after the kernel and memory transfers have finished.
-
-\item[const char *traceName]- A short identifier used for tracing and logging.
-
-\item[function runKernel]- A user defined host function to run the kernel.
-   We will pass this function three parameters:
-   \begin{description}
-   \item[workRequest]- The workrequest being run.
-   \item[kernelStream]- The cuda stream to run the kernel in.
-   \item[deviceBuffers]- An array of device pointers, indexed by bufferID.
-   \end{description}
-
-\item[int state]- the stage of a \code{workRequest}'s execution, set and used internally by the GPU Manager
-
-\item[void *userData]-  may be used to pass scalar values to kernel
-calls, such as the size of an array.
-
-\end{description}
-
-\subsubsection{dataInfo}
-\begin{description}
-\item[int bufferID]- the ID of a buffer in the runtime system's buffer table.
-May be specified by the user if direct control over the buffer space is
-desired. Otherwise, if it is left unset or set to a negative value, the GPU Manager will
-assign a valid buffer ID.
-
-\item[int transferToDevice, transferFromDevice]- flags to indicate if the buffer
-should be transferred to the device prior to the execution of a kernel, and/or
-transferred out after the kernel
-
-\item[int freeBuffer]- a flag to indicate if the device buffer memory should be
-freed after execution of \code{workRequest}.
-
-\item[void *hostBuffer]- pointer to host data buffer. In order to allow
-asynchronous memory transfer and data computation on device this buffer must
-be allocated from page-locked memory.
-\begin{alltt}
-void *hostBuffer = hapi\_poolMalloc(size);
-\end{alltt}
-% code around memPool?
-This returns the buffer of required size from the GPU Manager's pool of pinned memory on the host.
-Direct allocation of pinned memory (e.g. using \code{cudaMallocHost}) is discouraged,
-as it will block the CPU until pending GPU work has finished executing.
-The user must add the \code{-DGPU\_MEMPOOL} flag while compiling CUDA files.
-This is required to enable fetching of page-locked memory from GPU Manager.
-You may add it with your \code{NVCC\_FLAGS}.
-
-\item[size\_t size]- size of buffer in bytes.
-\end{description}
-
-\subsubsection{Work Request Example}
-
-Here is an example method for creating a \code{workRequest} of the addition of
-two vectors A and B.
-
-\begin{alltt}
-#include "wr.h"
-#define BLOCK_SIZE 256
-void createWorkRequest(int vectorSize, float *h_A, float *h_B, float **h_C, int myIndex, CkCallback *cb)
-\{
-    dataInfo *info;
-    workRequest *vecAdd = new workRequest;
-    int size = vectorSize * sizeof(float);
-
-    vecAdd->dimGrid.x = (vectorSize - 1) / BLOCK_SIZE + 1;
-    vecAdd->dimBlock.x = BLOCK_SIZE;
-    vecAdd->smemSize = 0;
-    vecAdd->nBuffers = 3;
-    vecAdd->bufferInfo = new dataInfo[vecAdd->nBuffers];
-
-    /* Buffer A */
-    info = &(vecAdd->bufferInfo[0]);
-
-    /* The Buffer ID will be set by the library
-       if unset or set to a negative number,
-       or it can be set by the user. */
-    info->bufferID = -1;
-
-    info->transferToDevice = YES;
-    info->transferFromDevice = NO;
-    info->freeBuffer = YES;
-
-    /* This fetches the pinned host memory already allocated by the library,
-       required for asynchronous data transfers. */
-    info->hostBuffer = hapi_poolMalloc(size);
-
-    /* Copy the data to the workRequest's buffer. */
-    memcpy(info->hostBuffer, h_A, size);
-
-    info->size = size;
-
-    /* Buffer B will be same as A.*/
-
-    /* Buffer C */
-    info = &(vecAdd->bufferInfo[2]);
-    info->transferFromDevice = YES;
-    info->hostBuffer = hapi_poolMalloc(size)
-
-    / * We change the address to the address returned by the library
-        to read the copied result */
-    *h_C = (float *)info->hostBuffer;
-
-    /* a CkCallback pointer */
-    vecAdd->callbackFn = cb;
-
-    vecAdd->traceName = "add";
-
-    /* kernel run function */
-    vecAdd->runKernel = run_add;
-
-    vecAdd->userData = new int;
-    memcpy(vecAdd->userData, &vectorSize, sizeof(int));
-
-    /* enqueue the workRequest in the workRequestQueue. */
-    enqueue(vecAdd);
-\}
-\end{alltt}
-
-\subsection{Writing Kernels}
-
-Writing a kernel is unchanged from normal CUDA programs.
-Kernels are written in one (or more) .cu files.
-Here is an example of \code{vectorAdd.cu}.
-The full example can be found in \code{examples/charm++/cuda/vectorAdd/}.
-
-\begin{alltt}
-__global__ void vecAdd(float *a, float *b, float *c, int n)
-\{
-    // Get our global thread ID
-    int id = blockIdx.x * blockDim.x + threadIdx.x;
-
-    // Make sure we do not go out of bounds
-    if (id < n)
-        c[id] = a[id] + b[id];
-\}
-\end{alltt}
-
-\subsection{Launching Kernels}
-
-Kernel launches are identical to regular kernel launches in normal CUDA
-programs, run in a small dedicated function.
-
-\begin{alltt}
-void run_add(workRequest *wr, cudaStream_t kernelStream, void **deviceBuffers)
-\{
-    printf("Add KERNEL \textbackslash{n}");
-/*
- *  devBuffers is declared by our library during the init phase on every processor.
- *  It jumps to the correct array index with the help of bufferID,
- *  which is supplied by the library or user.
- */
-    vecAdd<<< wr->dimGrid, wr->dimBlock, wr->smemSize, kernelStream>>>
-        ((float *) deviceBuffers[wr->bufferInfo[0].bufferID],
-         (float *) deviceBuffers[wr->bufferInfo[1].bufferID],
-         (float *) deviceBuffers[wr->bufferInfo[2].bufferID],
-         *((int *) wr->userData));
-\}
-\end{alltt}
-
-\section{Compiling}
-
-As mentioned earlier, there are no changes to the \code{.ci} and \code{.C}
-files. Therefore there are no changes in compiling them.
-CUDA code, however, must be compiled using \code{nvcc}.
-You can use the following example makefile to compile a .cu file.
-More example codes can be found in the \code{examples/charm++/cuda} directory.
-
-\begin{alltt}
-CUDA_LEVEL=35
-
-NVCC = /usr/local/cuda/bin/nvcc
-
-NVCC_FLAGS = -O3 -c -use_fast_math -DGPU_MEMPOOL
-
-NVCC_FLAGS += -arch=compute_\$(CUDA_LEVEL) -code=sm_\$(CUDA_LEVEL)
-
-NVCC_INC = -I/usr/local/cuda/include
-
-CHARMINC = -I\$\{CHARMDIR\}/include
-
-LD_LIBS= -lcublas
-
-all: vectorAdd
-\qquad\$(NVCC) \$(NVCC_FLAGS) \$(NVCC_INC) \$(CHARMINC) -o vectorAddCU.o vectorAdd.cu
-\end{alltt}
-
-GPU Manager also supports \texttt{CuBLAS} or other GPU libraries in exactly the same way.
-Call \texttt{CuBLAS} or the other GPU library directly from a kernel run function;
-creating the \code{workRequest} works the same as any other kernel.
-
-
-\section{Debugging}
-
-A few useful things for debugging:
-
-\begin{enumerate}
-\item Enabling the \code{GPU\_MEMPOOL\_DEBUG} flag
-(using \code{-DGPU\_MEMPOOL\_DEBUG}) during execution prints debug
-statements, including when buffers are allocated and freed.
-
-\item Define the \code{GPU\_DEBUG} flag (as above) to output more verbose
-debugging information during execution.
-
-\item When debugging, add the flag \code{-g} during compilation.
-
-\item For more infromation on these and other flags, see the comment at the top
-of \code{cuda-hybrid-api.cu}.
-\end{enumerate}
+\setcounter{secnumdepth}{3}
+
+\renewcommand{\code}[1]{\texttt{\textbf{#1}}}
+
+\newcommand{\cuda}{\code{CUDA}}
+
+\section{Overview}
+
+GPU Manager is a task offload and management library for efficient use of
+CUDA-enabled GPUs in \charmpp{} applications. CUDA code can be integrated
+in \charmpp{} just like any \CC{} program, but the resulting performance
+is likely to be far from ideal. This is because overdecomposition, a core
+concept of \charmpp{}, creates fine-grained objects and tasks which causes
+problems on the GPU.
+
+GPUs are throughput-oriented devices with peak computational capabilities that
+greatly surpass equivalent-generation CPUs but with limited control logic. This
+currently constrains them to be used as accelerator devices controlled by code
+on the CPU. Traditionally, programmers have had to either (a) halt the execution
+of work on the CPU whenever issuing GPU work to simplify synchronization or (b)
+issue GPU work asynchronously and carefully manage and synchronize concurrent
+GPU work in order to ensure progress and good performance. The latter option,
+which is practically a requirement in \charmpp{} to preserve asynchrony, becomes
+significantly more difficult with numerous concurrent objects that issue kernels
+and data transfers to the GPU.
+
+The \charmpp{} programmer is strongly recommended to use CUDA streams to
+mitigate this problem, by assigning separate streams to chares. This allows
+operations in different streams to execute concurrently. It should be noted that
+concurrent data transfers are limited by the number of DMA engines, and current
+GPUs have one per direction of the transfer (host-to-device, device-to-host).
+The concurrent kernels feature of CUDA allows multiple kernels to execute
+simultaneously on the device, as long as resources are available.
+
+An important factor of performance with using GPUs in \charmpp{} is that the CUDA
+API calls invoked by chares to offload work should be non-blocking. The chare
+that just offloaded work to the GPU should yield the PE so that other chares
+waiting to be executed can do so. Unfortunately, many CUDA API calls used to
+wait for completion of GPU work, such as \code{cudaStreamSynchronize} and
+\code{cudaDeviceSynchronize}, are blocking. Since the PEs in \charmpp{} are
+implemented as persistent kernel-level threads mapped to each CPU core, this
+means other chares cannot run until the GPU work completes and the blocked
+chare finishes executing. To resolve this issue, GPU Manager provides Hybrid API
+(HAPI) to the \charmpp{} user, which includes new functions to implement the
+non-blocking features and a set of wrappers to the CUDA runtime API functions.
+The non-blocking API allows the user to specify a \charmpp{} callback upon offload
+which will be invoked when the operations in the CUDA stream are complete.
+
+\section{Building GPU Manager}
+
+GPU Manager is not included by default when building \charmpp{}. In order to use
+GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g.
+
+\begin{alltt}
+./build charm++ netlrts-linux-x86_64 cuda -j8
+\end{alltt}
+
+Building GPU Manager requires an installation of the CUDA toolkit on the system.
+
+\section{Using GPU Manager}
+
+As explained in the Overview section, use of CUDA streams is strongly recommended.
+This allows kernels offloaded by chares to execute simultaneously on the GPU,
+which boosts performance if the kernels are small enough for the GPU to be able
+to allocate resources.
+
+In a typical \charmpp{} application using CUDA, \code{.C} and \code{.ci} files
+would contain the \charmpp{} code, whereas a \code{.cu} file would include the
+definition of CUDA kernels and a function that serves as an entry point from
+the \charmpp{} application to use GPU capabilities. CUDA/HAPI calls for data
+transfers or kernel invocations would be placed inside this function, although
+they could also be put in a \code{.C} file provided that the right header files
+are included (\code{<cuda_runtime.h> or "hapi.h"}). The user should make sure
+that the CUDA kernel definitions are compiled by \code{nvcc}, however.
+
+After the necessary data transfers and kernel invocations, \code{hapiAddCallback}
+would be placed where typically \code{cudaStreamSynchronize} or
+\code{cudaDeviceSynchronize} would go. This informs the runtime that a chare has
+offloaded work to the GPU, allowing the provided \charmpp{} callback to be
+invoked once it is complete. The non-blocking API has the following prototype:
+
+\begin{alltt}
+  void hapiAddCallback(cudaStream_t stream, CkCallback* callback);
+\end{alltt}
+
+Other HAPI calls:
+
+\begin{alltt}
+  void hapiCreateStreams();
+  cudaStream_t hapiGetStream();
+
+  cudaError_t hapiMalloc(void** devPtr, size_t size);
+  cudaError_t hapiFree(void* devPtr);
+  cudaError_t hapiMallocHost(void** ptr, size_t size);
+  cudaError_t hapiFreeHost(void* ptr);
+
+  void* hapiPoolMalloc(int size);
+  void hapiPoolFree(void* ptr);
+
+  cudaError_t hapiMemcpyAsync(void* dst, const void* src, size_t count,
+                              cudaMemcpyKind kind, cudaStream_t stream = 0);
+
+  hapiCheck(code);
+\end{alltt}
+
+\code{hapiCreateStreams} creates as many streams as the maximum number of
+concurrent kernels supported by the GPU device. \code{hapiGetStream} hands
+out a stream created by the runtime in a round-robin fashion. The
+\code{hapiMalloc} and \code{hapiFree} functions are wrappers to the corresponding
+CUDA API calls, and \code{hapiPool} functions provides memory pool functionalities
+which are used to obtain/free device memory without interrupting the GPU.
+\code{hapiCheck} is used to check if the input code block executes without errors.
+The given code should return \code{cudaError_t} for it to work.
+
+Example \charmpp{} applications using CUDA can be found under
+\code{examples/charm++/cuda}. Codes under #ifdef USE_WR use the workRequest
+scheme, which is now deprecated.
-- 
2.11.4.GIT