In CUDA, a thread block is the basic building unit of execution. You always launch your kernels as a collection of thread blocks.
Remember:
ππ§ππ β π½π‘π€ππ β πππ§π₯ β πππ§πππ
π A thread block is a group of threads that:
Although it feels like a software concept, itβs deeply tied to the hardware β shared memory lives inside the SM, and its limits define how many blocks can execute on a SM.
π When launching a kernel, you choose the block size β how many threads it contains and in what shape: 1D, 2D, or 3D (to easily map your data layout).
πππ’3 ππ‘π€ππ πΏππ’(8, 8, 8);
The total number of threads = x Γ y Γ z and cannot exceed 1024 threads per block on modern GPUs.
π₯ But hereβs the catch: block size affects SM occupancy β how many warps (groups of 32 threads) can run in parallel.
π§© Finding the sweet spot is like solving a puzzle β you need to balance threads per block, number of blocks per SM, and resource usage per block to get the best performance.
π‘ A good starting point: 128β256 threads per block, then fine-tune with profiling.