Skip to content
Snippets Groups Projects
user avatar
Erik Strand authored
47a847bd
History

CUDA

CUDA is the programming model used for general purpose programming of NVIDIA GPUs. It's an extension of C++ that adds support for transferring data between the CPU (host) and GPU (device), and using the GPU's massive parallelism for arbitrary computation.

Why CUDA?

There are a couple layers to this. First, why are we considering GPUs next to microcontrollers like the humble SAMD11? While GPUs started as a tool for actual graphics processing, they quickly became popular in the high performance and scientific computing worlds for their sheer FLOPS. Nowadays many machine learning applications are moving their way out of the datacenter and into physical devices, to accelerate tasks like object detection or speech interfaces. As a result there are an increasing number of small GPU modules designed for integration into robotics, autonomous systems, etc. They're still a little pricey now, but they get cheaper each year.

Second, why CUDA? It's a proprietary system that only works for NVIDIA hardware. OpenCL is the open equivalent. It works on pretty much any GPU, and unlike CUDA you can look at all the code behind it. This is a major win for OpenCL in my book, but at this point in time CUDA is still the de facto standard for most scientific and machine learning GPU code. NVIDIA was the first to market in that space, and they've kept their lead since then. Hopefully we start to see a more diverse ecosystem soon.

How to think about GPU programming

The main point of a GPU is to run many threads at once. Often this is thousands at a time -- far more threads than you can run on any single CPU. But the way threads behave on a GPU is different from how they behave on a CPU, so writing GPU code is a lot different than writing CPU code, even if you have a big compute cluster and could launch an equivalent number of CPU threads.

The biggest difference is that GPU threads are less independent than CPU threads. On a CPU, two threads might compete for resources -- like RAM, or a floating point unit -- but they can execute completely separate applications. On (almost all) GPUs, threads run in groups of 32 called warps. For the most part, all threads in a warp have to execute the same instruction at the same time! So say you have some code like this:

if (condition) {
    do_this();
} else {
    do_that();
}

If some threads need to do_this(), and others need to do_that(), they won't call these functions independently. Instead, first all the threads that need to take the first branch will execute do_this(), and the other threads just wait. Then all the threads that need to take the second branch will execute do_that() while the first group of threads waits. In total it takes as much time as if all threads executed both branches. So you don't want to have large blocks of code that only a few threads need to execute, or loops that a few threads will run way more times than others, since these things effectively hold the remaining threads hostage. On the other hand, if all the threads in the warp end up taking the same branch, then the warp can skip the other branch completely.

Warps, Blocks, and Latency Hiding

Conceptually, knowing that threads run in groups rather than individually is the most important thing to understand. But if you want to write the fastest GPU code, you need to know some more about how the hardware works. There's a whole hierarchy between individual threads and the GPU's global resources (like VRAM), and structuring your code to fit in this hierarchy neatly is how you get the best performance.

Warps are grouped together into blocks. Each block shares some memory, cache, and other resources. All the threads in a block can use this shared memory to communicate with each other; communication between threads in separate blocks is much more limited. When you launch a GPU kernel, you have to say how many blocks you want to run, and how many threads you want in each block (and this latter number is almost always a multiple of 32, otherwise you'll end up wasting threads in some warps).

Finally, all threads can access global memory. Global memory is much larger than the shared memory in each block, but it's also much slower. For data intensive applications, basically all of the runtime comes from transferring data from global memory and back again; the processing each thread does takes negligible time. Certain memory access patterns are much more efficient than others -- generally speaking you want all threads in a warp to access adjacent memory locations at the same time, so that the warp as a whole can load one contiguous block of data. This is often the single most important thing to get right if you want to write fast GPU code.

The last important concept to understand is latency hiding. At the end of the day, threads are executed by CUDA cores that reside in streaming multiprocessors. These multiprocessors can quickly switch between running different warps. So while most GPUs have a thousand or so physical cores, you'll commonly launch blocks that contain tens of thousands if not millions of threads. Execution of these threads is interleaved, so at any given moment the GPU is likely to be working on several times as many threads as it has cores. The point of this is to not have to wait. Say one warp executes a costly (i.e. slow) read from global memory. Rather than wait 100 clock cycles (roughly speaking) before executing that warp's next instruction, the streaming multiprocessor tries to find a different warp that's ready to execute its next instruction right away. When the first warp's data finally arrives, the streaming multiprocessor will pick it back up again and execute its next instruction.

So generally speaking you want to run a whole lot of threads at once, so that the streaming multiprocessor has the best odds of always finding some warp that's ready to do something useful. But the streaming multiprocessor only has so much memory etc. (that gets divvied up as the shared memory for all the blocks it's processing). So if all your blocks need lots of resources, each streaming multiprocessor will only be able to handle a few of them at a time, and you'll end up in situations where no warp is ready to execute its next instruction. This reduces the streaming multiprocessor's occupancy, which is basically the amount of time it spends doing useful things. Balancing the number of threads with resource usage to increase occupancy is one of the most important concerns for writing really fast GPU code.

Setup

The first step is to figure out what GPU you're going to use. Most desktops have GPUs, though to run CUDA code you'll have to make sure you have an NVIDIA GPU. Higher end laptops (especially gaming ones) also often have dedicated GPUs. Note that no Macs have NVIDIA GPUs; the two companies have a bit of a feud going on. You can also rent time on GPU equipped machines in the cloud. Amazon's P3 instances have up to eight V100 GPUS, which can deliver up to a petaflop(!). Finally, you could purchase a Jetson development kit.

Next you need to install CUDA. If you're using an Amazon GPU instance or a Jetson, this will be set up for you. If you're setting up your own computer, you can download and install CUDA from NVIDIA's website. CUDA comes with a whole suite of tools, the most important of which is nvcc, the CUDA compiler.

CUDA assumes you already have a C++ compiler installed. So if you're setting up your own computer you may also need to install gcc. While you're at it, it's probably a good idea to install make as well. We'll use it to build the examples in this repo. There are more detailed instructions for installing these tools here.

Building and Running

Once everything is installed, you just need to run make from this directory. This should build two example programs.

The first, get_gpu_info, just looks for NVIDIA GPUs in your system and prints some stats on each one. It doesn't actually run anything on any of them.

The second, saxpy runs a basic linear algebra routine on the CPU and GPU, and compares the run times.