Threads are launched in whatโs called a ๐๐ง๐๐โmore specifically, a grid of thread blocks. In other words: you canโt simply say, โlaunch 1 million threads that execute this kernel.โ You need to break them into blocks and specify how many blocks to launch. This gives you the grid dimensions.
๐ A good mental model to visualize the structure is:
๐๐ง๐๐ โ ๐ฝ๐ก๐ค๐๐ โ ๐๐๐ง๐ฅ โ ๐๐๐ง๐๐๐
CUDA gives us access to this entire hierarchy directly in code through built-in variables like ๐๐ก๐ค๐๐ ๐๐๐ญ, ๐ฉ๐๐ง๐๐๐๐๐๐ญ, ๐๐ก๐ค๐๐ ๐ฟ๐๐ข, and ๐๐ง๐๐๐ฟ๐๐ข. These let us figure out where each thread sits in the larger structureโacross blocks and threads within a blockโand write code that maps work accordingly.
๐ The concept of a ๐ฉ๐๐ง๐๐๐ ๐๐ก๐ค๐๐ isnโt just an artificial abstractionโitโs closely tied to the hardware architecture and software requirements. For instance, one thing that we as programmers might want is to be able to efficiently communicate between threads. To support this, GPU SIMD units include a small, on-chip memory area called ๐๐ง๐ค๐ช๐ฅ ๐๐๐๐ง๐๐ ๐๐๐ข๐ค๐ง๐ฎ.
This memory lives directly inside the SIMD unit and is exclusive to it, meaning only threads running on the same SIMD unit can access and share this memory efficiently. Thatโs exactly why thread blocks exist: a block is a group of threads that are scheduled to run on the same SIMD unit, allowing them to take full advantage of this shared memory.
โ ๏ธ Hardware constraints: SIMD units can only handle a limited number of active warps (Fermi 24, AMD GCN 40, Newer NVIDIA architectures: up to 48). Because of this, the number of threads in a block is limited (typically up to 1024, or 32 warps)โchoosing the right size becomes a crucial performance decision.
Examples:
To remember: