Let’s look at how to program GPUs using CUDA, with a simple example: vector addition.

CUDA lets you write C code that runs on the GPU, in the form of functions known as π™ π™šπ™§π™£π™šπ™‘π™¨. These functions are written in C and can be called from the CPU (host) side just like regular functions, but they execute on the GPU.

The idea:

  1. Define the CUDA kernel β€” a function where each thread adds a pair of elements from vectors A and B. Use __global__ to mark a function as a kernel.
  2. Set up the data the kernel needs β€” input arrays and an output array. CUDA functions like cudaMalloc and cudaMemcpy handle GPU memory allocation and data transfer.
  3. Launch the kernel from the CPU using this syntax:
    myKernel<<>>(args);

Threads are grouped into blocks. The total number of threads is:

numBlocks Γ— numThreadsPerBlock

Other notes: