In CUDA, a thread block is the basic building unit of execution. You always launch your kernels as a collection of thread blocks.

Remember:
π™‚π™§π™žπ™™ β†’ π˜½π™‘π™€π™˜π™  β†’ 𝙒𝙖𝙧π™₯ β†’ π™π™π™§π™šπ™–π™™

🌟 A thread block is a group of threads that:

Although it feels like a software concept, it’s deeply tied to the hardware β€” shared memory lives inside the SM, and its limits define how many blocks can execute on a SM.

πŸ‘‰ When launching a kernel, you choose the block size β€” how many threads it contains and in what shape: 1D, 2D, or 3D (to easily map your data layout).

π™™π™žπ™’3 π™—π™‘π™€π™˜π™ π˜Ώπ™žπ™’(8, 8, 8);

The total number of threads = x Γ— y Γ— z and cannot exceed 1024 threads per block on modern GPUs.

πŸ”₯ But here’s the catch: block size affects SM occupancy β€” how many warps (groups of 32 threads) can run in parallel.

🧩 Finding the sweet spot is like solving a puzzle β€” you need to balance threads per block, number of blocks per SM, and resource usage per block to get the best performance.

πŸ’‘ A good starting point: 128–256 threads per block, then fine-tune with profiling.