Clearly, one of the most important concepts in GPU programming and architecture.

🧠 The main idea?
Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions — we're talking about an order of magnitude.
For example, the NVIDIA A100 can retrieve up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!

👉 To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse:
L2, L1, and most importantly, a small (48–164 KB), very fast, on-chip memory inside the Streaming Multiprocessor (SM) — called Shared Memory.
🗒️ AMD calls it LDS (Local Data Share) - same concept, different name.

🌟 Why is Shared Memory so important?

Programmers are allowed to use it in code
You can explicitly tell the compiler: “Allocate this chunk in shared memory.” Just use the
```
__shared__
```
keyword in front of the variable you want to place in shared memory.

This memory is:

Statically allocated
The size is known at compile time
The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel
The SM uses this information to decide whether it can schedule a thread block based on its available shared memory

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

Example:

__shared__ int s_mem[256];

Shared across the thread block
Every thread in a thread block has access to the same shared memory space.
This allows fast communication and synchronization between threads — something GPU programmers always want.
-- that also means that all the threads part of a thread block have to execute on the same SM.

⚠️ Whether it’s 32, 256, or 1024 threads — they all share the same block of shared memory.
👉 Finding the right balance between threads per block and shared memory usage is crucial for performance.

You must manually synchronize
Access must be explicitly synchronized. Warps might compete for shared memory access, so it’s on the programmer to manage that. CUDA & HLSL provide
```
__syncthreads()
```
function for this purpose.
Atomic operations
Supports fast atomic operations (at a thread block level), leading to the concept of privatization.
Huge performance benefits
Used correctly, it unlocks powerful optimizations:
- Tiling techniques for image filtering or matrix multiplication
- Histograms
- Sorting algorithms