Clearly, one of the most important concepts in GPU programming and architecture.

🧠 The main idea?
Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions β€” we're talking about an order of magnitude.
For example, the NVIDIA A100 can retrieve up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!

πŸ‘‰ To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse:
L2, L1, and most importantly, a small (48–164 KB), very fast, on-chip memory inside the Streaming Multiprocessor (SM) β€” called Shared Memory.
πŸ—’οΈ AMD calls it LDS (Local Data Share) - same concept, different name.

🌟 Why is Shared Memory so important?

  1. Programmers are allowed to use it in code
    You can explicitly tell the compiler: β€œAllocate this chunk in shared memory.” Just use the
    __shared__
    keyword in front of the variable you want to place in shared memory.

This memory is:

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

Example:

__shared__ int s_mem[256];
  1. Shared across the thread block
    Every thread in a thread block has access to the same shared memory space.
    This allows fast communication and synchronization between threads β€” something GPU programmers always want.
    -- that also means that all the threads part of a thread block have to execute on the same SM.

⚠️ Whether it’s 32, 256, or 1024 threads β€” they all share the same block of shared memory.
πŸ‘‰ Finding the right balance between threads per block and shared memory usage is crucial for performance.

  1. You must manually synchronize
    Access must be explicitly synchronized. Warps might compete for shared memory access, so it’s on the programmer to manage that. CUDA & HLSL provide
    __syncthreads()
    function for this purpose.
  2. Atomic operations
    Supports fast atomic operations (at a thread block level), leading to the concept of privatization.
  3. Huge performance benefits
    Used correctly, it unlocks powerful optimizations: