Threads are launched in whatโ€™s called a ๐™œ๐™ง๐™ž๐™™โ€”more specifically, a grid of thread blocks. In other words: you canโ€™t simply say, โ€œlaunch 1 million threads that execute this kernel.โ€ You need to break them into blocks and specify how many blocks to launch. This gives you the grid dimensions.

๐Ÿ‘‰ A good mental model to visualize the structure is:
๐™‚๐™ง๐™ž๐™™ โ†’ ๐˜ฝ๐™ก๐™ค๐™˜๐™  โ†’ ๐™’๐™–๐™ง๐™ฅ โ†’ ๐™๐™๐™ง๐™š๐™–๐™™

CUDA gives us access to this entire hierarchy directly in code through built-in variables like ๐™—๐™ก๐™ค๐™˜๐™ ๐™„๐™™๐™ญ, ๐™ฉ๐™๐™ง๐™š๐™–๐™™๐™„๐™™๐™ญ, ๐™—๐™ก๐™ค๐™˜๐™ ๐˜ฟ๐™ž๐™ข, and ๐™œ๐™ง๐™ž๐™™๐˜ฟ๐™ž๐™ข. These let us figure out where each thread sits in the larger structureโ€”across blocks and threads within a blockโ€”and write code that maps work accordingly.

๐ŸŒŸ The concept of a ๐™ฉ๐™๐™ง๐™š๐™–๐™™ ๐™—๐™ก๐™ค๐™˜๐™  isnโ€™t just an artificial abstractionโ€”itโ€™s closely tied to the hardware architecture and software requirements. For instance, one thing that we as programmers might want is to be able to efficiently communicate between threads. To support this, GPU SIMD units include a small, on-chip memory area called ๐™‚๐™ง๐™ค๐™ช๐™ฅ ๐™Ž๐™๐™–๐™ง๐™š๐™™ ๐™ˆ๐™š๐™ข๐™ค๐™ง๐™ฎ.

This memory lives directly inside the SIMD unit and is exclusive to it, meaning only threads running on the same SIMD unit can access and share this memory efficiently. Thatโ€™s exactly why thread blocks exist: a block is a group of threads that are scheduled to run on the same SIMD unit, allowing them to take full advantage of this shared memory.

โš ๏ธ Hardware constraints: SIMD units can only handle a limited number of active warps (Fermi 24, AMD GCN 40, Newer NVIDIA architectures: up to 48). Because of this, the number of threads in a block is limited (typically up to 1024, or 32 warps)โ€”choosing the right size becomes a crucial performance decision.

Examples:

To remember: