Over the past two decades, GPUs have evolved well beyond their original purpose of rendering pixels on a screen. Today, they’re powerful parallel processors at the heart of many tech-fields: graphics, computer vision, simulations, signal processing, bioinformatics, and—of course—AI and machine learning. Because of this, understanding how GPUs work and how to use them efficiently has become a valuable skill for any programmer.

In this article, I’d like us to explore how GPUs work from the ground up—from SIMD units to real GPU architectures—and learn how to program, debug, and profile the code that runs on them. This is (hopefully) a beginner-friendly tutorial, with occasional deep dives into technical details. We’ll be using CUDA and NVIDIA architectures for the examples, but the concepts apply to other platforms as well.

Brief Introduction to GPUs

Explaining exactly what a GPU is isn’t that easy—just too many layers of metal and silicon to uncover 😄— but the main idea is simple: GPUs excel at executing many instructions simultaneously. One of the easiest ways to picture this is by comparing them to something you’re already familiar with—the CPU.

CPU vs GPU

We can think of it like this:
A CPU usually comes with only a few powerful cores, each able to execute any type of instruction the hardware supports. A GPU, on the other hand, is packed with thousands of smaller cores, all designed to work in parallel. That’s what makes GPUs so good at handling thousands of tasks simultaneously.

To give you an idea, Ryzen Zen 4 CPU might have 6 cores, while NVIDIA’s Ada architecture packs in over 16,384 CUDA cores plus 512 Tensor Cores. (Note: NVIDIA chip is also 9 times bigger than the Ryzen CPU, but the difference is still huge)

Now, comparing CPU cores to GPU cores isn’t exactly fair. They’re designed for very different purposes. A CPU core is usually much more complex, capable of juggling a wide range of tasks—from running the operating system to handling input/output operations. GPU cores, on the other hand, are generally simpler and often specialized for specific kinds of calculations—like NVIDIA’s Tensor Cores, which are built specifically to accelerate AI and machine learning workloads.

Another important distinction is in how GPU cores are organized. GPUs typically follow a SIMD (Single Instruction, Multiple Data) model, meaning that many cores execute the same instruction at the same time, but on different pieces of data. This is perfect for problems like graphics or matrix math, where you need to repeat the same calculation over and over across huge datasets. We will talk a bit more in depth about this topic in the following sections but for now let’s focus on how GPUs are actually used.

GPU Programming Overview

What we, as programmers, want is to execute work on GPUs. But GPUs don’t operate on their own —the CPU acts as the control station, orchestrating the flow. Basically, the CPU sends commands to the GPU, and the GPU executes those commands across its many cores.

CPU sends commands

To make this workflow possible, hardware vendors provide different APIs. Before 2006, the main way to program GPUs was through graphics APIs like OpenGL or Direct3D, which required thinking any problem to solve in terms of graphical tasks (a concept also known as GPGPU). Then NVIDIA introduced CUDA, allowing programmers to use GPUs for general-purpose computation. Around the same time, other platforms emerged: OpenCL (2009) provided a cross-vendor framework for writing code that runs on GPUs from different manufacturers, while AMD’s ROCm offers a similar platform for AMD GPUs.

For the rest of this article, I’ll focus on CUDA, but the concepts are similar in other environments. You can find some alternatives listed at the bottom of this page.

Introduction to CUDA

CUDA is more than just a driver and an API for programming GPUs. It’s a complete development platform that gives you the tools and libraries needed to write, profile, optimize and run code on NVIDIA GPUs.

To get started with CUDA, the easiest approach is to follow NVIDIA’s official installation guide here. It will walk you through all the necessary steps.

System Requirements

Requirement	Description
GPU	CUDA-capable NVIDIA GPU
OS	Supported version of Microsoft Windows
IDE	Supported version of Microsoft Visual Studio
Driver	NVIDIA driver

Setting Up CUDA on Windows

Go to NVIDIA’s CUDA downloads page (at the moment of writing, the latest available version is CUDA 13.0)
Select your windows version and download the toolkit
Run the installer. I recommend installing all components (including Nsight Compute, Visual Studio integration, etc.) so you have the full set of tools available

CUDA Installer Setup

By the end of the installation you will be able to see the list of components that have been installed. As you can see there are lots of them, a lot of libraries, tools for profiling and the most important of them: NVCC, Nvidia CUDA Compiler

CUDA Installer Setup

At this point, you should be able to run nvcc --version in a terminal and get the CUDA version installed.

nvcc –version

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2025 NVIDIA Corporation
Cuda compilation tools, release 13.0, V13.0.48

The CUDA Compiler (NVCC)

One nice feature of CUDA is that you can write both CPU and GPU code in the same file, using the same programming language, that in contrast to graphics APIs, where you usually have to separate shaders from application code.

The CUDA Compiler (nvcc) is NVIDIA’s tool for building CUDA programs. Its job is to take your source code, separate the CPU parts from the GPU parts, and compile them with the appropriate compiler. The CPU (also called the host) is compiled with your regular system compiler (like GCC or MSVC), while the GPU (the device) is compiled with NVIDIA’s owned compiler.

// hello.cu
#include <stdio.h>

// Code that runs on the GPU
__global__ void my_gpu_kernel() {
    printf("Hello from the GPU!\n");
}

// Code that runs on the CPU
int main() {
    // Call the GPU function
    my_gpu_kernel<<<1>>>();
    return 0;
}

You can compile the code by running:

nvcc hello.cu -o hello

How NVCC works under the hood

When you compile a CUDA program with nvcc, it goes through several stages to handle the CPU and GPU parts of your code.

NVCC preprocesses the GPU-specific code (the device code) and compiles it into a CUDA binary (cubin) or PTX intermediate code. These binaries are packaged into a fat binary, which can contain multiple GPU architectures in one file.
The entire program is preprocessed again for the CPU (host) and compiled with the system’s native compiler.
The fat binary containing the GPU code is embedded into the host code, generating a final host object.

For simplicity, for the rest of this article, we’ll use Microsoft Visual Studio as our development environment, but you can always compile directly with nvcc — Visual Studio does the same thing behind the scenes.

Hello, GPU

If you open Visual Studio, you should see the option to create a CUDA project—let’s go ahead and select that.

Visual Studio Project Template

By default, Visual Studio will generate a sample CUDA program for you, but to keep things simple, we’ll start from scratch and leave only the main function.

#include <iostream>

int main() {
    std::cout << "Hello from CPU!\n";
    return 0;
}

Build and run the program, and you should see the message “Hello from CPU!” in your console—amazing!

CPU output

If we take a closer look at what Visual Studio does under the hood when you compile the project, you’ll see that it simply calls nvcc with a set of arguments.

nvcc.exe kernel.cu -o kernel.cu.obj -ccbin HOST_COMPILER_PATH -gencode=arch=compute_75,code=\"sm_75,compute_75\"

-o        // Specifies the output file name.
-ccbin    // Tells nvcc which C++ compiler to use for compiling the CPU (host) code
-gencode  // This flag tells nvcc which GPU architectures to target.
          // arch=compute_75 → The compute capability of the GPU you want to target (here, Turing architecture, e.g., RTX 20 series).
PS: If we know which architecture we’re targeting (eg: Ada, Maxwell, Hopper, etc.), we can adjust these flags to take full advantage of the capabilities that the architecture supports.

The key takeaway is that nvcc separates CPU and GPU code and prepares the GPU code to run on the appropriate architecture. But how do we actually tell CUDA what code is going to run on the GPU?

Defining CUDA Kernels

That’s where “CUDA kernels” come into play. Kernels are functions that run on the GPU, and in CUDA we mark them with the __global__ keyword.

__global__ void my_gpu_kernel() {
    printf("Hello from GPU!\n");
}

Marking a function with __global__ tells nvcc that it’s a kernel—code meant to run on the GPU. Inside the kernel, you simply write the instructions you want the GPU to execute.

Launching Work on the GPU

To tell the GPU to execute these kernel, we simply call the kernel by its name, using the special CUDA launch syntax <<< , >>>. For now, let’s keep it simple and just pass “1” inside the brackets. We’ll dive deeper in a moment to see why this syntax is important and how it controls the execution of your GPU code.

#include <iostream>

__global__ void my_gpu_kernel() {
    printf("Hello from GPU!\n");
}

int main() {    
    // Launch kernel on the GPU
    my_gpu_kernel <<<1>> > ();

    std::cout << "Hello from CPU!\n";
    return 0; 
}

Believe it or not, that’s it! The my_gpu_kernel function runs on the GPU. If you build and run the project, you should see two messages printed in the console:

Hello from the CPU!
Hello from GPU!

Printing on the GPU, under the hood.

Of course, the GPU can’t directly print to the console. GPU printf statements are captured and buffered by the CUDA at runtime. The actual output is sent back to the host (CPU) and displayed in the console when the GPU completes its work.

Note: The order of the messages might vary. The GPU executes asynchronously by default, which means the CPU can continue running while the GPU is still working. Using cudaDeviceSynchronize() ensures the CPU waits for the GPU to finish before moving to the next instruction.

Profiling and Debugging with NVIDIA Nsight

You don’t have to take my word for it—we can actually verify that our code runs on the GPU using NVIDIA Nsight Compute. This tool, installed with the CUDA toolkit, lets us profile and debug GPU kernels to see exactly what’s happening on the GPU.

Setting it up is pretty simple: open Nsight Compute, click the Start Activity button, locate your executable, choose an output file, and then launch the profiling session.

Setup NVIDIA Nsigh Compute

The main panel will show all the GPU function calls (kernels) executed on the GPU. In our case, there’s just a single kernel running: my_gpu_kernel.
For each kernel, you can find useful information such as how long it took to execute (22 microseconds in my case), how many threads were launched, and more.

Kernels that run on the GPU

You can also inspect the assembly code (PTX or SASS) for each kernel by double-clicking the kernel in the main panel. This lets you view the high-level C code we wrote side by side with the PTX/SASS intermediate code that nvcc generated.

GPU Assembly

We’ll explore NVIDIA Nsight Compute in more depth in future posts, so for now it’s enough to know you can use it to measure and inspect GPU kernel execution.

Steps to Run CUDA Kernels

Let’s wrap up what we need to do to execute code using CUDA:

Define a kernel with __global__
Launch it with <<<>>> from the host code
Compile the code using nvcc
Run

Source code:

#include <stdio.h>

// A simple GPU kernel
__global__ void helloGPU() {
    printf("Hello from the GPU!\n");
}

int main() {
    // Launch the GPU kernel
    helloGPU<<<1>>>();

    // Wait for GPU to finish
    cudaDeviceSynchronize();

    // CPU code
    printf("Hello from the CPU!\n");

    return 0;
}

Vector Addition Example

Vector addition example

Now, let’s talk about a real problem. We will start with a basic, the most common operation on the internet: the addition of 2 vectors on the GPU. We have 2 input vectors: A and B and we want C = A + B.

The code can look like this:

// Kernel: iterate over all elements, add them together
__global__ void vectorAdd(const int* A, const int* B, int* C, int N) {
    for (int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

You can notice that the function takes as arguments the 3 arrays that we need A, B, and C, as well as the number of elements N.

But there’s a problem we need to solve first: this kernel will run on the GPU, and the GPU has its own VRAM memory. It can’t directly access data stored in the CPU’s main RAM.
So, before we can run our kernel, we need to allocate memory for the three vectors in the GPU’s VRAM and copy the input data over.

Allocating GPU Memory

To allocate memory on the GPU, CUDA provides the cudaMalloc function. The concept is simple: you tell CUDA how many bytes you need, and it gives you back a pointer to that space in the GPU’s memory.

CUDA Malloc

The function definition looks like:

cudaError_t cudaMalloc(void** devPtr, size_t size);
// devPtr   → adress of a pointer where the device memory address will be stored
// size     → the number of bytes to allocate

In our case, what we need to do is:

int* d_A, *d_B, *d_C;

// Allocate space for the input and output vectors on the GPU, N is the number of elements
cudaMalloc((void**)&d_A, N * sizeof(int));
cudaMalloc((void**)&d_B, N * sizeof(int));
cudaMalloc((void**)&d_C, N * sizeof(int));

This allocates space in the GPU memory and stores the device pointers in d_A, d_B and d_C.

We can now use these pointers and pass them to the kernel when executing it.

int main() {
  // Launch the GPU kernel
+ vectorAdd<<<1>>>(d_A, d_B, d_C, N); 

}

Freeing the allocated memory

After you’re done using GPU memory, it’s important to free it to avoid memory leaks. The CUDA counterpart to malloc is cudaFree, which releases memory previously allocated with cudaMalloc.

Copying Data to the GPU

We’ve allocated memory on the GPU for our input vectors, but that memory isn’t initialized yet. We just allocated it, we haven’t set it. So, what we will need to do is to send the input data from the CPU over to the GPU.

CUDA provides the cudaMemcpy function for this purpose. It copies data between host (CPU) and device (GPU) memory. It looks like this:

cudaError_t cudaMemcpy(void* dst, const void* src, size_t count, cudaMemcpyKind kind);
// dst    → destination pointer
// src    → source pointer
// count  → number of bytes to copy
// kind   → direction of the copy, e.g., cudaMemcpyHostToDevice or cudaMemcpyDeviceToHost (CPU to GPU or GPU to CPU)

For our vector addition example, we can copy the input vectors like this:

// Copy input vectors from host to device
// h_A and h_B are the CPU-allocated vectors, holding the data we want
cudaMemcpy(d_A, h_A, N * sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(d_B, h_B, N * sizeof(int), cudaMemcpyHostToDevice);

Now the GPU memory holds the actual input data, ready for our kernel to use.

Vector Addition Kernel (Single Thread, 1M Elements)

Let’s now put everything together:

#include <iostream>
#include <cuda_runtime.h>

// GPU kernel: single thread iterates over all elements
__global__ void vectorAdd(const int* A, const int* B, int* C, int N) {
    for (int i = 0; i < N; i++) {
        C[i] = A[i] + B[i];
    }
}

int main() {
    int N = 1024000;  // number of elements
    size_t size = N * sizeof(int);

    // Host arrays
    int* h_A = new int[N];
    int* h_B = new int[N];
    int* h_C = new int[N];

    // Initialize input vectors
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i * 2;
    }

    // Device pointers
    int *d_A, *d_B, *d_C;

    // Allocate GPU memory
    cudaMalloc((void**)&d_A, size);
    cudaMalloc((void**)&d_B, size);
    cudaMalloc((void**)&d_C, size);

    // Copy input data to GPU
    cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);

    // Launch kernel
    vectorAdd<<<1>>>(d_A, d_B, d_C, N);

    // Copy result back to host
    cudaMemcpy(h_C, d_C, size, cudaMemcpyDeviceToHost);

    // Free GPU memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    return 0;
}

Awesome! Build & run and you should get the addition of 1 Million elements on the GPU.

CUDA Debugger

Now, if you want to verify that the results are correct, there are a couple of options:

Copy the results back to the CPU and inspect the values of vector C, in Visual Studio or output it to the console

C++ Debugging

Or, use the NVIDIA Nsight CUDA Debugger, which integrates into Visual Studio when you install the CUDA toolkit

To set it up in Visual Studio:

Go to Build → Rebuild Solution (for your project)
Add a breakpoint in your kernel
Go to Extensions → Nsight → Start CUDA Debugging

That’s it. You can inspect the local variables in the Locals window and visualize GPU memory in Memory Window (Deubg → Windows → Memory → Memory View).

CUDA Debugger

In the above screenshot I added a breakpoint in the kernel and stepped through a few elements. In the Memory window at the bottom, you can see the values of vectors A, B, and C—correctly showing the sum of A and B. For readability, I configured the memory views to display 1 element per column and to interpret memory as 4-byte integers.

Note: you can drag and drop an address from the Locals window into the Memory window to directly inspect the memory region that interests you.

The Nsight CUDA Debugger can do much more than that: you can inspect SASS assembly, view warp registers, monitor thread states, and more. We’ll soon see these “advanced features” in practice.

If you want to find out more about it, check out the official docs: NVIDIA Nsight Visual Studio Edition – CUDA Debugger.

CUDA Error checking

It’s important to check for errors after launching kernels or calling CUDA runtime APIs. Otherwise, your program might silently fail without telling you why. CUDA provides a mechanism to report errors. Here’s a simple pattern you can use for that:

  #define CUDA_CHECK(err) \
    if (err != cudaSuccess) { \
        printf("CUDA Error: %s (at %s:%d)\n", cudaGetErrorString(err), __FILE__, __LINE__); \
        exit(EXIT_FAILURE); \
    }

  int main() {
    // Allocate memory
    CUDA_CHECK(cudaMalloc(&d_a, 10 * sizeof(int)));

    // Launch kernel
    my_gpu_kernel<<<1>>>();

    // Check for kernel launch errors
    CUDA_CHECK(cudaGetLastError());
  }

That being said, this type of error checking isn’t perfect. It doesn’t tell you anything about runtime issues in your GPU code (like out-of-bounds memory access, race conditions, or performance bottlenecks). It only checks CUDA API calls. There are other tools (such as NVIDIA Nsight, Compute Sanitizer, or cuda-memcheck) that can help us identify and debug GPU-specific issues—but we’ll cover those in future posts.

Measuring GPU Code Performance

Let’s see how much time it takes to calculate the sum of our 1 million elements. For profiling, you can use Nsight Compute (as we did earlier) and check the duration of each kernel.

In my case, Nsight Compute reports 50ms (50!!!).

Vector Add Kernel Duration

CUDA also offers a way to record events, which work kind of like C++’s chrono library. The idea is simple: you create two events, sandwich your kernel launch between them, and then measure the elapsed time.

cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);

// Record Start
cudaEventRecord(start);

// Launch kernel
vectorAdd<<<1>>>(d_A, d_B, d_C, N);

// Record End
cudaEventRecord(stop);
cudaEventSynchronize(stop);

// Calculate the elapsed time
float milliseconds = 0;
cudaEventElapsedTime(&milliseconds, start, stop);

printf("Kernel execution time: %f ms\n", milliseconds);

So now we have two ways of timing the kernel:

Nsight Compute, which gives us detailed profiling info
CUDA events, which give us just the raw elapsed time

Note: the two don’t show the same kind of information. Nsight gives you a detailed info (like memory transfers, stalls, etc.), while events only measure the time between two points.

Going back to numbers: It takes ~50 ms to add 1 million elements on the GPU. That’s a lot. I did the same operation on the CPU and it took around 1 ms.

So, why the huge difference?

Well… the answer lies in the fact that we’re only using a single GPU core, we are doing all the work on a single thread. That’s a complete waste of GPU computation power. (It also highlights the difference between a CPU core and a GPU core—as we discussed earlier, they’re designed for very different workloads.)

Launching Multiple Threads, CUDA Thread Hierarchy

So, instead of wasting all that parallel power with just one thread, we can launch many threads at once. In our case: we can launch 1 Million threads and map one thread to an element (C[i] = A[i] + B[i]).

Multiple threads

This way, instead of one thread looping over a million elements, we’ll have a million threads, each doing a single addition.

In CUDA, we achieve this using the special kernel launch syntax. When you launch a kernel, the angle brackets <<< >>> let you specify how many threads you want to launch.

But there’s a thing. You can’t just say “launch one million threads”. CUDA breaks the kernel launch into a well defined hierarchy (or a CTA grid as they call it).

At the top level, you have the Grid
A Grid is made up of Thread Blocks
Each Thread Block contains a number of Threads

The important detail is: while you can have as many blocks as you want, each block is limited to a maximum of 1024 threads (You will get an error if you try to make it bigger than that).

So, to handle 1,024,000 elements, we split the work into:

1,000 blocks
1,024 threads per block
= 1,024,000 threads in total.

The number of Blocks and the Block size (threads per block) are specified in CUDA’s special kernel launch syntax, inside the angle brackets <<< >>>.

vectorAdd<<<1000, 1024>>>(d_A, d_B, d_C, N);
// 1000 → the number of blocks in the grid
// 1024 → the number of threads in each block

// The syntax is:
kernel_name<<<gridDim, blockDim>>>
// gidDrim  → how many blocks to launch
// blockDim → how many threads in the block (max 1024)

Note: Both gridDim and blockDim can be specified in 1D, 2D, or 3D. This makes it easy to map threads to your data: 1D for vectors, 2D for images, 3D for voxels, and so on.

Then, inside the kernel, each thread figures out “which element am I responsible for?” by combining its block index and thread index. CUDA provides these values as built-in variables (blockIdx, threadIdx, blockDim) that are automatically set by at runtime by the driver + hardware:

// threadIdx.x  → ID of the thread inside its block
// blockIdx.x   → ID of the block inside the grid
// blockDim.x   → number of threads in a block

__global__ void vectorAdd(int* A, int* B, int* C, int N) {
    // Compute the global thread ID
    int i = blockIdx.x * blockDim.x + threadIdx.x;

    // Each thread adds one element
    if (i < N) {
        C[i] = A[i] + B[i];
    }
}

Experimenting with the grid size.

I encourage you to experiment with the concept of grid:

How many threads are launched in total if you set the thread block size to 512 and launch 2 blocks?
How many threads are launched if dimGrid = (2, 2) and blockDim = (32, 32)?
What about dimGrid = (4, 1, 1) and blockDim = (16, 1, 1)?

Profiling with 1M threads

Let’s profile our new approach (using Nsight Compute, again). Select the executable, Start Activity, Select the kernel.

Source code:

#include <iostream>
#include <cuda_runtime.h>


__global__ void vectorAdd(const int* A, const int* B, int* C, int N) {
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    if (idx < N) {
        C[idx] = A[idx] + B[idx];
    }
}

int main() {
    const int N = 1024000;  

    // Host memory allocation
    int* h_A = new int[N];
    int* h_B = new int[N];
    int* h_C = new int[N];

    // Initialize host arrays
    for (int i = 0; i < N; i++) {
        h_A[i] = i;
        h_B[i] = i;
    }

    // Device memory allocation
    int *d_A, *d_B, *d_C;
    cudaMalloc((void**)&d_A, N * sizeof(int));
    cudaMalloc((void**)&d_B, N * sizeof(int));
    cudaMalloc((void**)&d_C, N * sizeof(int));

    // Copy input data from host to device
    cudaMemcpy(d_A, h_A, N * sizeof(int), cudaMemcpyHostToDevice);
    cudaMemcpy(d_B, h_B, N * sizeof(int), cudaMemcpyHostToDevice);

    // Kernel launch configuration
    int blockSize = 1024;
    int numBlocks = (N + blockSize - 1) / blockSize;

    // Launch kernel
    vectorAdd<<<numBlocks, blockSize>>>(d_A, d_B, d_C, N);

    // Copy results back to host
    cudaMemcpy(h_C, d_C, N * sizeof(int), cudaMemcpyDeviceToHost);

    // Free device memory
    cudaFree(d_A);
    cudaFree(d_B);
    cudaFree(d_C);

    // Free host memory
    delete[] h_A;
    delete[] h_B;
    delete[] h_C;

    return 0;
}

0.02ms!!! That, from 50ms. Huge improvement, right?

But here’s something that you might find interesting:

If I set the thread block size to 256 (instead of 1024), the result is actually slightly faster
If I set the thread block size to 32, the result becomes a bit worse

So, the thread block size matters for performance. It isn’t just a software concept — it ties directly into the GPU hardware itself. We’ll dig deeper into why this happens in the following articles.

Test Scenario	Duration
1 CPU thread	1 ms
1 GPU thread	50 ms
1M GPU threads, block size = 1024	0.020 ms
1M GPU threads, block size = 256	0.018 ms
1M GPU threads, block size = 32	0.022 ms

The difference isn’t very big here, but our use case is also super simple. In more complex scenarios, thread block size can have a much greater impact on performance.

Debugging multiple threads

Let’s now see what happens when we run multiple threads and how we can debug them. For simplicity, I’ll reduce the number of threads we launch to 128, split into 4 blocks of 32 threads.

+   const int N = 128;  // number of elements
+   const int THREADS_PER_BLOCK = 32;
    const int blocks = (N + THREADS_PER_BLOCK - 1) / THREADS_PER_BLOCK;
    vectorAdd <<<blocks, THREADS_PER_BLOCK>>> (d_A, d_B, d_C, N);

Next, I’ll add a breakpoint and attach the CUDA Debugger. Now, if you open Nsight → Windows, you’ll notice some interesting options. One of them is the Warp Info Window. Let’s open it—but first, a quick detour.

What is a warp?
In short, a warp is a group of 32 threads that always execute the same instruction. That’s how GPUs work under the hood: they split the threads we launch into fixed-size chunks of 32. We’ll talk more about warps later, but for now just keep this in mind.

Warp Info Window

Warp info window

The Warp Info Window shows a few interesting things:

The number of warps currently running (that’s top right of the window): In our case, it’s 4. We launched 128 threads (4 blocks × 32 threads), which maps to 4 warps
Then, we have the “Shader Info”, showing the id of the block each warp belongs to. Here it’s simple: one warp (32 threads) per block, so warp IDs map perfectly to block IDs: 0, 1, 2, 3
Next, in the “Threads” section we can see all the 32 threads that are part of a warp. You’ll also see a little yellow arrow indicating the active thread you’re currently debugging. You can double-click on any rectangle to switch threads and see how variables (like i) change

Warp Watch Window

Let’s now inspect the Warp Watch Window.
This window is even cooler: it shows the values of your variables across the entire warp—all 32 threads.

Warp watch window

For example, here I’m debugging the thread with index 80 in our grid. Notice how:

blockId.x = 2
threadId.x = 16 (inside its block, which matches the warp size)
So i = blockId.x * blockDim.x + threadId.x = 2 * 32 + 16 = 80
if you check the arrays, you can see the results of the additions for elements A and B.

PS: Try experimenting with different configurations:

Use a thread block size of 64 and see how the block IDs change
Run just one thread—what happens then?
Add a branch like if (i < 16) and watch what happens when not all threads in a warp take the same path (this is called divergence)

So, behind the scenes, the GPU groups threads into warps, which adds another hidden layer to our launch hierarchy:
Grid → Blocks → Warps → Threads
Next, let’s dig into why GPUs do this.

GPU Architectures

We’ve now seen how to program the GPU, but let’s take a step back and understand how they actually work under the hood. Whether it’s pixels, matrix multiplications, or signal processing, the GPU’s main bet is that the same instruction will be performed on many data elements at once. To achieve this efficiently, GPUs use some execution units specialized exactly for this type of operation.

Understanding SIMD Architecture

At the core of GPU hardware stands the SIMD unit — Single Instruction, Multiple Data. A SIMD unit is essentially an array of cores controlled by a single instruction pointer.

SIMD Unit

It is important to understand that the same instruction is shared by all the cores in the SIMD unit, because it means that all the threads mapped to these cores will always execute the same instruction. This is different from a traditional multi-core CPU, where each core has its own instruction stream and can execute different instructions independently.

Multi Core CPU vs SIMD Unit

Remember, the GPU’s design is based on the assumption that the same instruction will be executed across many data elements. So, instead of managing each thread individually (which might get expensive in terms of performance), threads are packed into groups of N, where N is the number of cores in a SIMD unit (also referred to as the vector length).

Now, in practice, SIMD Units don’t have just 4 Cores, as in my drawings— they typically have 32 cores (*). This means that threads are grouped into chunks of 32. NVIDIA calls this set of 32 threads a warp.

Warps

A warp is a set of 32 threads that run the same instruction at the same time. These warps are executed on the GPU’s SIMD units.

Warps, explained

But, as you may have noticed, as a programmer, you don’t work directly with warps when writing CUDA code—the hardware and driver handle that for you. The work (threads) you launch to the GPU is automatically grouped into warps behind the scenes. However, understanding how warps work is crucial if you want your code to run efficiently. For example:

Warp divergence—when threads in the same warp follow different execution paths—can lead to major performance issues
Intra-warp instructions (like shuffle and ballot) allow threads within a warp to share data efficiently

We will discuss more about these concepts in future posts.

To summarize:

A SIMD unit contains multiple cores sharing one instruction pointer
Threads are grouped into warps of 32, and all 32 threads in a warp run in lockstep
This simplifies hardware design
- managing groups of 32 threads is easier than managing thousands individually
- It makes instruction dispatch and fetch cheaper and easier to optimize—one instruction per warp instead of one per thread

Recap on launch hierarchy

We’ve seen that on the software side, we launch threads in blocks, and multiple blocks form a grid. But the thread block’s threads are actually split into warps by the driver & the hardware.
The hierarchy is:
Grid → Block → Warp → Thread
You might wonder why we even need the concept of a thread block. The answer lies in the hardware itself—SIMD units play a key role in how threads are grouped and executed.

NVIDIA Streaming Multiprocessor

In practice, real-world SIMD units are far more complex than our simple conceptual model and come with many more capabilities. A well-known example is NVIDIA’s Fermi Streaming Multiprocessor (2010):

NVIDIA Fermi's Streaming Multiprocessor

Some important characteristics of the Streaming Multiprocessor (SM):

It can keep up to 48 warps in flight. These are warps that have been scheduled to the SM and are ready to execute. The Dispatch Units choose what warps to execute each cycle
Warps execute on one for the 4 Execution Blocks; Each execution block is specialized for certain instruction types. You can think of an execution block as a SIMD unit, where multiple cores execute the same instruction together
Having 4 Execution Block means that 4 warps can execute at the same time on a SM
Each SM also comes with a Shared Memory block—a small, on-chip memory that enables fast communication between threads in a thread block. This is exactly why we have the software concept of a thread block: an SM can’t keep an infinite number of threads in flight at once, and we want to maximize the use of this very fast memory. For that reason, GPU architectures tie Shared Memory to a thread block (and also impose a maximum size limit).

Of course, things are much more complex than that-I just wanted to give you a feel for what a real SIMD Unit looks like. In future articles we’ll dig into how warps compete over SM resources and how code that ignores hardware capabilities can degrade performance.

One more thing to mention: GPUs don’t have just one SM; they have many. For example, Fermi (2010, so pretty old) had 16 SMs. If we calculate the maximum potential threads in flight at the same time, we get: 16 SMs × 48 warps × 32 threads = 24,576 threads potentially in flight

That’s a lot. However, on newer architectures (like Hopper, 2022), we have up to 144 SMs, and each SM can hold up to 64 warps in flight: 144 SMs × 64 warps × 32 threads = 294,912 threads potentially in flight

So, an order of magnitude bigger.

NVIDIA Fermi Architecture

That being said, we’ve covered a lot! We explored how to use CUDA to run code on the GPU, learned about CUDA’s thread hierarchy and how threads are launched, looked at profiling and debugging techniques, and touched on some of the theory behind GPU architectures—from SIMD units and warps to NVIDIA’s Streaming Multiprocessors.

In the next articles, we’ll dive deeper into techniques for writing high-performance GPU code while keeping hardware capabilities in mind. Meanwhile, I’ll share a list of useful resources to help you continue learning. Thank you for reading!

Resources

Resource	Description
ETH Zurich GPU Architecture & Programming (YouTube)	Excellent lecture series on GPU hardware and programming. The first two lessons cover SIMD units and a CUDA intro — highly recommended!
NVIDIA Tesla Whitepaper	Release paper for the Tesla architecture — the first CUDA-enabled GPU with Streaming Multiprocessors. Great historical insight.
CUDA C++ Programming Guide	Official NVIDIA documentation for CUDA C++, the go-to reference for any CUDA developer.
Zen vs CUDA Cores (Computer Enhance)	A comparison between AMD Zen4 CPU cores and NVIDIA Ada CUDA cores, showing architectural differences.
NVIDIA Donut Samples	NVIDIA Donut is a graphics API abstraction framework that supports DirectX, Vulkan, and OpenGL. If you want to experiment with compute using graphics APIs, this is a great starting point. Keep in mind, though, that the setup differs from CUDA: you can’t just mix CPU and GPU code in the same file. Instead, you’ll need to separate them, compile GPU code with the right compiler, and then use dispatch functions to launch work on the GPU.
NVIDIA CUDA Samples	Probably the best repository for learning CUDA. It’s packed with examples, covering everything from the basics to advanced features. Highly recommended if you want hands-on learning.

gpudemystified

Table of Contents

Brief Introduction to GPUs

GPU Programming Overview

Introduction to CUDA

Setting Up CUDA on Windows

The CUDA Compiler (NVCC)

Hello, GPU

Defining CUDA Kernels

Launching Work on the GPU

Profiling and Debugging with NVIDIA Nsight

Steps to Run CUDA Kernels

Vector Addition Example

Allocating GPU Memory

Copying Data to the GPU

Vector Addition Kernel (Single Thread, 1M Elements)

CUDA Debugger

Measuring GPU Code Performance

Launching Multiple Threads, CUDA Thread Hierarchy

Profiling with 1M threads

Debugging multiple threads

Warp Info Window

Warp Watch Window

GPU Architectures

Understanding SIMD Architecture

Warps

Recap on launch hierarchy

NVIDIA Streaming Multiprocessor

Resources

Table of Contents

Brief Introduction to GPUs

GPU Programming Overview

Introduction to CUDA

Setting Up CUDA on Windows

The CUDA Compiler (NVCC)

Hello, GPU

Defining CUDA Kernels

Launching Work on the GPU

Profiling and Debugging with NVIDIA Nsight

Steps to Run CUDA Kernels

Vector Addition Example

Allocating GPU Memory

Copying Data to the GPU

Vector Addition Kernel (Single Thread, 1M Elements)

CUDA Debugger

Measuring GPU Code Performance

Launching Multiple Threads, CUDA Thread Hierarchy

Profiling with 1M threads

Debugging multiple threads

Warp Info Window

Warp Watch Window

GPU Architectures

Understanding SIMD Architecture

Warps

Recap on launch hierarchy

NVIDIA Streaming Multiprocessor

Resources

Pro Benefits

Hi, User!

Report a Problem