In CUDA, a π©ππ§πππ ππ‘π€ππ is the basic building unit of execution. You always launch your kernels as a collection of thread blocks.
Remember: ππ§ππ β π½π‘π€ππ β πππ§π₯ β πππ§πππ
π A thread block is a group of threads that: β’ Execute together on the same Streaming Multiprocessor (SM) β’ Share fast on-chip shared memory β’ Can synchronize with each other during execution Although it feels like a software concept, itβs deeply tied to the hardware β shared memory lives inside the SM, and its limits define how many blocks can execute on a SM.
π When launching a kernel, you choose the block size β how many threads it contains and in what shape: 1D, 2D, or 3D (to easily map your data layout). πππ’3 ππ‘π€ππ πΏππ’(8, 8, 8); The total number of threads = x Γ y Γ z and cannot exceed 1024 threads per block on modern GPUs.
π₯ But hereβs the catch: block size affects SM occupancy β how many warps (groups of 32 threads) can run in parallel. β’ Too big (1024 threads) β fewer blocks fit per SM β lower occupancy β’ Too small (32 threads) β you hit the SMβs block limit before using all its available warps β lower occupancy
π§© Finding the sweet spot is like solving a puzzle β you need to balance threads per block, number of blocks per SM, and resource usage per block to get the best performance.
π‘ A good starting point: 128β256 threads per block, then fine-tune with profiling.
π± Donβt forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8
#GPU #GPUProgramming #GPUArchitecture #CUDA #NVIDIA #AMD