Clearly, one of the most important concepts in GPU programming and architecture.
π§ The main idea?
Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions β we're talking about an order of magnitude.
For example, the NVIDIA A100 can retrieve up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!
π To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse:
L2, L1, and most importantly, a small (48β164 KB), very fast, on-chip memory inside the Streaming Multiprocessor (SM) β called Shared Memory.
ποΈ AMD calls it LDS (Local Data Share) - same concept, different name.
π Why is Shared Memory so important?
__shared__
keyword in front of the variable you want to place in shared memory.This memory is:
Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.
Example:
__shared__ int s_mem[256];
β οΈ Whether itβs 32, 256, or 1024 threads β they all share the same block of shared memory.
π Finding the right balance between threads per block and shared memory usage is crucial for performance.
__syncthreads()
function for this purpose.