From c925ff9485b53e906d8e88e1d3cc550425a808aa Mon Sep 17 00:00:00 2001 From: Jaemin Choi Date: Mon, 27 Nov 2017 15:53:18 -0600 Subject: [PATCH] Cleanup #1491: Update documentation of GPUManager Change-Id: I9d1fc90f8556c14b868015fa2fc0ea127c439c3d --- doc/libraries/gpumanager.tex | 482 +++++++++++-------------------------------- 1 file changed, 117 insertions(+), 365 deletions(-) rewrite doc/libraries/gpumanager.tex (95%) diff --git a/doc/libraries/gpumanager.tex b/doc/libraries/gpumanager.tex dissimilarity index 95% index ce74426e4b..158ed364f3 100644 --- a/doc/libraries/gpumanager.tex +++ b/doc/libraries/gpumanager.tex @@ -1,365 +1,117 @@ -\setcounter{secnumdepth}{3} - -\renewcommand{\code}[1]{\texttt{\textbf{#1}}} - -\newcommand{\cuda}{\code{CUDA}} - -GPU Manager is a task offload and management library for efficient use of -CUDA-enabled GPUs in \charmpp{} applications. Compared to direct use of CUDA -(through issuing kernel invocation and GPU data transfer calls in user -code) GPU Manager provides the following advantages: -\begin{enumerate} -\item Automatic management and synchronization of tasks -\item Automatic overlap of data transfer and kernel invocation for concurrent tasks -\item A simplified work flow mechanism using CkCallback to return to user code after completion of each work request -\item Reduced synchronization overhead through centralized management of all GPU tasks -\end{enumerate} - -\section{Building GPU Manager} - -GPU Manager is not included by default when building \charmpp{}. In order to use -GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g. - -\begin{alltt} -./build charm++ netlrts-linux-x86_64 cuda -j8 -\end{alltt} -Building GPU Manager requires an installation of the CUDA toolkit on the system. -Charm will search \$CUDATOOLKIT\_HOME, /usr/local/cuda, and -/usr/lib/nvidia-cuda-toolkit, in that order, to locate the CUDA installation. - -\section{Overview and Work Flow} - -GPUs are throughput-oriented devices with peak computational capabilities that -greatly surpass equivalent-generation CPUs but with limited control logic that -constraints them to use as accelerator devices controlled by code executing on -the CPU. - -The GPU's dependence on the CPU for dispatch and synchronization of -coarse-grained data transfer and kernel execution has traditionally required -programmers to either (a) halt the execution of work on the CPU whenever issuing -GPU work to simplify synchronization or (b) issue GPU work asynchronously and -carefully manage and synchronize concurrent GPU work in order to ensure -satisfactory progress and good performance. Further, the latter option becomes -significantly more difficult in the context of a parallel program with numerous -concurrent objects that all issue kernel and data transfer calls to the same GPU. - -The \charmpp{} GPU Manager is a library designed to address this issue by -automating the management of GPUs. Users of GPU Manager define \emph{work requests} -that specify the GPU kernel and any data transfer operations required before -and after completion of the kernel. The system controls the execution of the -work requests submitted by all the chares on a particular processor. This allows -it to effectively manage execution of work requests and overlap CPU-GPU data -transfer with kernel execution. In steady-state operation, GPU Manager overlaps -kernel execution of one work request with data transfer out of GPU memory for a -preceding work request and the data transfer into GPU memory for a subsequent -work request. This approach avoids blocking the CUDA DMA engine by only submitting -data transfers when they are ready to execute. When using GPU Manager, the user -does not need to poll for completion of GPU operations. The system manages -execution of a work request throughout its life cycle and returns control to the -user upon completion of a work request through a CkCallback object specified by -the user per work request. Another advantage of using GPU Manager is that the -system polls only for a handful of currently executing operations, which avoids -the problem of multiple chares all polling the GPU when using CUDA streams -directly. GPU Manager has options for recording profiling data for kernel -execution and data transfer which can be visualized using the \charmpp{} -Projections profiler. - -\subsection{Execution Model and Progress Engine} - -Like any \charmpp{} application, programs using GPU Manager typically consist of a -large number of concurrently executing objects. Each object executes code in -response to active messages received from some object within the parallel run, -during which it can send its own active messages or issue one or more work -requests to the GPU Manager for asynchronous execution. Work requests are always -submitted to the local GPU Manager instance at the processing element where the -call is issued. Incoming GPU work requests are simply copied into the GPU -Manager's scheduling queue, at which point the library returns and the caller -can continue with other work. - -\charmpp{} employs a message driven programming model. This includes a runtime -system scheduler that is triggered every time a method finishes execution. Under -typical CPU-only execution the scheduler examines the queue of incoming messages -and selects one based on priority and location in the queue. In a \cuda{} -build of \charmpp{}, the scheduler is also programmed to periodically invoke the -GPU Manager progress engine. - -GPU Manager contains a queue of all pending work requests. When its progress -function is called, GPU Manager determines whether pending GPU work has -completed since the last time the progress function was called, and whether -additional work requests can begin executing. A \code{workRequest} undergoes the -following stages during its execution: - -\begin{enumerate} -\item Device memory allocation and data transfer from host to device -\item Kernel execution -\item Data transfer back to host from device -\item Invocation of a callback function (specified in the \code{workRequest}) -\end{enumerate} - -Based on the instructions contained in each work request, the GPU Manager will -allocate the required buffers in GPU global memory and issue asynchronous CUDA -data transfer operations directly. In order to execute kernels, the GPU Manager -calls the \code{runKernel} function that must be defined by the user. This -function specifies the CUDA kernel call for your work request. - -Under steady state execution with multiple concurrent work requests, as -one \code{workRequest} progresses to the execution stage, GPU Manager will -initiate the data transfer for the second -\code{workRequest} in the queue, and so on. - -In a typical application, the work request definition, -kernel run functions, CUDA kernel definitions, and code for submission -of work requests would all go in a \code{.cu} file that is compiled with -\code{nvcc} separately from the other files (e.g. \code{.C}, \code{.ci}) in the -\charmpp{} application. -We make a function call to \code{createWorkRequest} from a .C file to create -and enqueue the workRequest. -The various resulting object files of the application are then to be linked -together into the final executable. - - -\section{API} - -Using GPU Manager involves: -\begin{enumerate} -\item Defining CUDA kernels as in a regular CUDA application -\item Defining work requests and their callback functions -\item Defining the \code{void runMyKernel(workRequest *wr, cudaStream\_t kernelStream, void **deviceBuffers)} functions, used by the GPU Manager to issue a kernel call based on the kernel identifier defined in the work request -\item Submitting work requests to the GPU Manager -\end{enumerate} - -\subsection{Work Request} -\code{workRequest} is a simple structure which contains the necessary parameters -for CUDA kernel execution along with some additional members for automating -data transfer between the host and the device. -A work request consists of the following data members: - -\begin{description} -\item[dim3 dimGrid]- a triple which defines the grid structure for the kernel; -in the example below \code{dimGrid.x} specifies the number of blocks. -\code{dimGrid.y} and \code{dimGrid.z} are unused. - -\item[dim3 dimBlock]- a triple defining each block's structure; -specifies the number of threads in up to three dimensions. - -\item[int smemSize]- the number of bytes in shared memory to be dynamically -allocated per block for kernel execution. - -\item[int nBuffers]- number of buffers used by the work request. - -\item[dataInfo *bufferInfo]- array of \code{dataInfo} structs containing buffer -information for the execution of the work request. This array must be of size \code{nBuffers}, e.g. -\begin{alltt} -code{wr->bufferInfo = (dataInfo *) malloc(wr->nBuffers * sizeof(dataInfo))} -\end{alltt} -We explain the contents of \code{dataInfo} struct below. - -\item[void *callbackFn]- a pointer to a \code{CkCallback} object specified by the user; -executed after the kernel and memory transfers have finished. - -\item[const char *traceName]- A short identifier used for tracing and logging. - -\item[function runKernel]- A user defined host function to run the kernel. - We will pass this function three parameters: - \begin{description} - \item[workRequest]- The workrequest being run. - \item[kernelStream]- The cuda stream to run the kernel in. - \item[deviceBuffers]- An array of device pointers, indexed by bufferID. - \end{description} - -\item[int state]- the stage of a \code{workRequest}'s execution, set and used internally by the GPU Manager - -\item[void *userData]- may be used to pass scalar values to kernel -calls, such as the size of an array. - -\end{description} - -\subsubsection{dataInfo} -\begin{description} -\item[int bufferID]- the ID of a buffer in the runtime system's buffer table. -May be specified by the user if direct control over the buffer space is -desired. Otherwise, if it is left unset or set to a negative value, the GPU Manager will -assign a valid buffer ID. - -\item[int transferToDevice, transferFromDevice]- flags to indicate if the buffer -should be transferred to the device prior to the execution of a kernel, and/or -transferred out after the kernel - -\item[int freeBuffer]- a flag to indicate if the device buffer memory should be -freed after execution of \code{workRequest}. - -\item[void *hostBuffer]- pointer to host data buffer. In order to allow -asynchronous memory transfer and data computation on device this buffer must -be allocated from page-locked memory. -\begin{alltt} -void *hostBuffer = hapi\_poolMalloc(size); -\end{alltt} -% code around memPool? -This returns the buffer of required size from the GPU Manager's pool of pinned memory on the host. -Direct allocation of pinned memory (e.g. using \code{cudaMallocHost}) is discouraged, -as it will block the CPU until pending GPU work has finished executing. -The user must add the \code{-DGPU\_MEMPOOL} flag while compiling CUDA files. -This is required to enable fetching of page-locked memory from GPU Manager. -You may add it with your \code{NVCC\_FLAGS}. - -\item[size\_t size]- size of buffer in bytes. -\end{description} - -\subsubsection{Work Request Example} - -Here is an example method for creating a \code{workRequest} of the addition of -two vectors A and B. - -\begin{alltt} -#include "wr.h" -#define BLOCK_SIZE 256 -void createWorkRequest(int vectorSize, float *h_A, float *h_B, float **h_C, int myIndex, CkCallback *cb) -\{ - dataInfo *info; - workRequest *vecAdd = new workRequest; - int size = vectorSize * sizeof(float); - - vecAdd->dimGrid.x = (vectorSize - 1) / BLOCK_SIZE + 1; - vecAdd->dimBlock.x = BLOCK_SIZE; - vecAdd->smemSize = 0; - vecAdd->nBuffers = 3; - vecAdd->bufferInfo = new dataInfo[vecAdd->nBuffers]; - - /* Buffer A */ - info = &(vecAdd->bufferInfo[0]); - - /* The Buffer ID will be set by the library - if unset or set to a negative number, - or it can be set by the user. */ - info->bufferID = -1; - - info->transferToDevice = YES; - info->transferFromDevice = NO; - info->freeBuffer = YES; - - /* This fetches the pinned host memory already allocated by the library, - required for asynchronous data transfers. */ - info->hostBuffer = hapi_poolMalloc(size); - - /* Copy the data to the workRequest's buffer. */ - memcpy(info->hostBuffer, h_A, size); - - info->size = size; - - /* Buffer B will be same as A.*/ - - /* Buffer C */ - info = &(vecAdd->bufferInfo[2]); - info->transferFromDevice = YES; - info->hostBuffer = hapi_poolMalloc(size) - - / * We change the address to the address returned by the library - to read the copied result */ - *h_C = (float *)info->hostBuffer; - - /* a CkCallback pointer */ - vecAdd->callbackFn = cb; - - vecAdd->traceName = "add"; - - /* kernel run function */ - vecAdd->runKernel = run_add; - - vecAdd->userData = new int; - memcpy(vecAdd->userData, &vectorSize, sizeof(int)); - - /* enqueue the workRequest in the workRequestQueue. */ - enqueue(vecAdd); -\} -\end{alltt} - -\subsection{Writing Kernels} - -Writing a kernel is unchanged from normal CUDA programs. -Kernels are written in one (or more) .cu files. -Here is an example of \code{vectorAdd.cu}. -The full example can be found in \code{examples/charm++/cuda/vectorAdd/}. - -\begin{alltt} -__global__ void vecAdd(float *a, float *b, float *c, int n) -\{ - // Get our global thread ID - int id = blockIdx.x * blockDim.x + threadIdx.x; - - // Make sure we do not go out of bounds - if (id < n) - c[id] = a[id] + b[id]; -\} -\end{alltt} - -\subsection{Launching Kernels} - -Kernel launches are identical to regular kernel launches in normal CUDA -programs, run in a small dedicated function. - -\begin{alltt} -void run_add(workRequest *wr, cudaStream_t kernelStream, void **deviceBuffers) -\{ - printf("Add KERNEL \textbackslash{n}"); -/* - * devBuffers is declared by our library during the init phase on every processor. - * It jumps to the correct array index with the help of bufferID, - * which is supplied by the library or user. - */ - vecAdd<<< wr->dimGrid, wr->dimBlock, wr->smemSize, kernelStream>>> - ((float *) deviceBuffers[wr->bufferInfo[0].bufferID], - (float *) deviceBuffers[wr->bufferInfo[1].bufferID], - (float *) deviceBuffers[wr->bufferInfo[2].bufferID], - *((int *) wr->userData)); -\} -\end{alltt} - -\section{Compiling} - -As mentioned earlier, there are no changes to the \code{.ci} and \code{.C} -files. Therefore there are no changes in compiling them. -CUDA code, however, must be compiled using \code{nvcc}. -You can use the following example makefile to compile a .cu file. -More example codes can be found in the \code{examples/charm++/cuda} directory. - -\begin{alltt} -CUDA_LEVEL=35 - -NVCC = /usr/local/cuda/bin/nvcc - -NVCC_FLAGS = -O3 -c -use_fast_math -DGPU_MEMPOOL - -NVCC_FLAGS += -arch=compute_\$(CUDA_LEVEL) -code=sm_\$(CUDA_LEVEL) - -NVCC_INC = -I/usr/local/cuda/include - -CHARMINC = -I\$\{CHARMDIR\}/include - -LD_LIBS= -lcublas - -all: vectorAdd -\qquad\$(NVCC) \$(NVCC_FLAGS) \$(NVCC_INC) \$(CHARMINC) -o vectorAddCU.o vectorAdd.cu -\end{alltt} - -GPU Manager also supports \texttt{CuBLAS} or other GPU libraries in exactly the same way. -Call \texttt{CuBLAS} or the other GPU library directly from a kernel run function; -creating the \code{workRequest} works the same as any other kernel. - - -\section{Debugging} - -A few useful things for debugging: - -\begin{enumerate} -\item Enabling the \code{GPU\_MEMPOOL\_DEBUG} flag -(using \code{-DGPU\_MEMPOOL\_DEBUG}) during execution prints debug -statements, including when buffers are allocated and freed. - -\item Define the \code{GPU\_DEBUG} flag (as above) to output more verbose -debugging information during execution. - -\item When debugging, add the flag \code{-g} during compilation. - -\item For more infromation on these and other flags, see the comment at the top -of \code{cuda-hybrid-api.cu}. -\end{enumerate} +\setcounter{secnumdepth}{3} + +\renewcommand{\code}[1]{\texttt{\textbf{#1}}} + +\newcommand{\cuda}{\code{CUDA}} + +\section{Overview} + +GPU Manager is a task offload and management library for efficient use of +CUDA-enabled GPUs in \charmpp{} applications. CUDA code can be integrated +in \charmpp{} just like any \CC{} program, but the resulting performance +is likely to be far from ideal. This is because overdecomposition, a core +concept of \charmpp{}, creates fine-grained objects and tasks which causes +problems on the GPU. + +GPUs are throughput-oriented devices with peak computational capabilities that +greatly surpass equivalent-generation CPUs but with limited control logic. This +currently constrains them to be used as accelerator devices controlled by code +on the CPU. Traditionally, programmers have had to either (a) halt the execution +of work on the CPU whenever issuing GPU work to simplify synchronization or (b) +issue GPU work asynchronously and carefully manage and synchronize concurrent +GPU work in order to ensure progress and good performance. The latter option, +which is practically a requirement in \charmpp{} to preserve asynchrony, becomes +significantly more difficult with numerous concurrent objects that issue kernels +and data transfers to the GPU. + +The \charmpp{} programmer is strongly recommended to use CUDA streams to +mitigate this problem, by assigning separate streams to chares. This allows +operations in different streams to execute concurrently. It should be noted that +concurrent data transfers are limited by the number of DMA engines, and current +GPUs have one per direction of the transfer (host-to-device, device-to-host). +The concurrent kernels feature of CUDA allows multiple kernels to execute +simultaneously on the device, as long as resources are available. + +An important factor of performance with using GPUs in \charmpp{} is that the CUDA +API calls invoked by chares to offload work should be non-blocking. The chare +that just offloaded work to the GPU should yield the PE so that other chares +waiting to be executed can do so. Unfortunately, many CUDA API calls used to +wait for completion of GPU work, such as \code{cudaStreamSynchronize} and +\code{cudaDeviceSynchronize}, are blocking. Since the PEs in \charmpp{} are +implemented as persistent kernel-level threads mapped to each CPU core, this +means other chares cannot run until the GPU work completes and the blocked +chare finishes executing. To resolve this issue, GPU Manager provides Hybrid API +(HAPI) to the \charmpp{} user, which includes new functions to implement the +non-blocking features and a set of wrappers to the CUDA runtime API functions. +The non-blocking API allows the user to specify a \charmpp{} callback upon offload +which will be invoked when the operations in the CUDA stream are complete. + +\section{Building GPU Manager} + +GPU Manager is not included by default when building \charmpp{}. In order to use +GPU Manager, the user must build \charmpp{} using the \cuda{} option, e.g. + +\begin{alltt} +./build charm++ netlrts-linux-x86_64 cuda -j8 +\end{alltt} + +Building GPU Manager requires an installation of the CUDA toolkit on the system. + +\section{Using GPU Manager} + +As explained in the Overview section, use of CUDA streams is strongly recommended. +This allows kernels offloaded by chares to execute simultaneously on the GPU, +which boosts performance if the kernels are small enough for the GPU to be able +to allocate resources. + +In a typical \charmpp{} application using CUDA, \code{.C} and \code{.ci} files +would contain the \charmpp{} code, whereas a \code{.cu} file would include the +definition of CUDA kernels and a function that serves as an entry point from +the \charmpp{} application to use GPU capabilities. CUDA/HAPI calls for data +transfers or kernel invocations would be placed inside this function, although +they could also be put in a \code{.C} file provided that the right header files +are included (\code{ or "hapi.h"}). The user should make sure +that the CUDA kernel definitions are compiled by \code{nvcc}, however. + +After the necessary data transfers and kernel invocations, \code{hapiAddCallback} +would be placed where typically \code{cudaStreamSynchronize} or +\code{cudaDeviceSynchronize} would go. This informs the runtime that a chare has +offloaded work to the GPU, allowing the provided \charmpp{} callback to be +invoked once it is complete. The non-blocking API has the following prototype: + +\begin{alltt} + void hapiAddCallback(cudaStream_t stream, CkCallback* callback); +\end{alltt} + +Other HAPI calls: + +\begin{alltt} + void hapiCreateStreams(); + cudaStream_t hapiGetStream(); + + cudaError_t hapiMalloc(void** devPtr, size_t size); + cudaError_t hapiFree(void* devPtr); + cudaError_t hapiMallocHost(void** ptr, size_t size); + cudaError_t hapiFreeHost(void* ptr); + + void* hapiPoolMalloc(int size); + void hapiPoolFree(void* ptr); + + cudaError_t hapiMemcpyAsync(void* dst, const void* src, size_t count, + cudaMemcpyKind kind, cudaStream_t stream = 0); + + hapiCheck(code); +\end{alltt} + +\code{hapiCreateStreams} creates as many streams as the maximum number of +concurrent kernels supported by the GPU device. \code{hapiGetStream} hands +out a stream created by the runtime in a round-robin fashion. The +\code{hapiMalloc} and \code{hapiFree} functions are wrappers to the corresponding +CUDA API calls, and \code{hapiPool} functions provides memory pool functionalities +which are used to obtain/free device memory without interrupting the GPU. +\code{hapiCheck} is used to check if the input code block executes without errors. +The given code should return \code{cudaError_t} for it to work. + +Example \charmpp{} applications using CUDA can be found under +\code{examples/charm++/cuda}. Codes under #ifdef USE_WR use the workRequest +scheme, which is now deprecated. -- 2.11.4.GIT