๐๐๐๐ง๐๐ ๐๐๐ข๐ค๐ง๐ฎ Clearly, one of the most important concepts in GPU programming and architecture.
๐ง The main idea? Pretty straightforward: ๐๐๐๐๐จ๐จ๐๐ฃ๐ ๐ข๐๐๐ฃ ๐ข๐๐ข๐ค๐ง๐ฎ ๐๐จ ๐๐ญ๐ฉ๐ง๐๐ข๐๐ก๐ฎ ๐จ๐ก๐ค๐ฌ compared to how fast GPU SIMD units can execute instructions โ weโre talking about an order of magnitude. For example, the NVIDIA A100 can retrieve up to ๐ฎ ๐ง๐/๐ of data from memory to cores, while its cores can execute up to ๐ญ๐ต.๐ฑ ๐ง๐๐๐ข๐ฃ๐ฆ ๐๐ฃ๐ฏ๐ฎ or ๐ฏ๐ญ๐ฎ ๐ง๐๐๐ข๐ฃ๐ฆ ๐๐ฃ๐ญ๐ฒ!
๐ To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse: L2, L1, and most importantly, a ๐จ๐ข๐๐ก๐ก (48โ164 KB), ๐ซ๐๐ง๐ฎ ๐๐๐จ๐ฉ, ๐ค๐ฃ-๐๐๐๐ฅ memory inside the Streaming Multiprocessor (SM) โ called ๐๐๐๐ง๐๐ ๐๐๐ข๐ค๐ง๐ฎ. ๐๏ธ AMD calls it LDS (Local Data Share) - same concept, different name.
๐ Why is Shared Memory so important? 1๏ธโฃ Programmers are ๐๐ก๐ก๐ค๐ฌ๐๐ ๐ฉ๐ค ๐ช๐จ๐ ๐๐ฉ ๐๐ฃ ๐๐ค๐๐ You can explicitly tell the compiler: โAllocate this chunk in shared memory.โ Just use the shared keyword in front of the variable you want to place in shared memory.
This memory is: โข Statically allocated โข The size is known at compile time โข The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel โข The SM uses this information to decide whether it can schedule a thread block based on its available shared memory
Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.
Example: shared int s_mem[256];
2๏ธโฃ ๐๐๐๐ง๐๐ across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads โ something GPU programmers always want. โ that also means that all the threads part of a thread block have to execute on the same SM.
โ ๏ธ Whether itโs 32, 256, or 1024 threads โ they all share the same block of shared memory. ๐ Finding the right balance between threads per block and shared memory usage is crucial for performance.
3๏ธโฃ You must ๐ข๐๐ฃ๐ช๐๐ก๐ก๐ฎ ๐จ๐ฎ๐ฃ๐๐๐ง๐ค๐ฃ๐๐ฏ๐ Access must be explicitly synchronized. Warps might compete for shared memory access, so itโs on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.
4๏ธโฃ ๐ผ๐ฉ๐ค๐ข๐๐ operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.
5๏ธโฃ Huge ๐ฅ๐๐ง๐๐ค๐ง๐ข๐๐ฃ๐๐ benefits Used correctly, it unlocks powerful optimizations: โข Tiling techniques for image filtering or matrix multiplication โข Histograms โข Sorting algorithms
๐ฅ Coming up: Weโll see how to practically use shared memory for the most common use cases.
๐ฑ Donโt forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8
#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified
#Instagram Clearly, one of the most important concepts in GPU programming and architecture.
๐ง The main idea? Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions. For example, the NVIDIA A100 retrieves up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!
๐ To overcome this bottleneck, GPU vendors designed a hierarchy of caches: L2, L1, and most importantly, a small (48โ164 KB), very fast, on-chip memory inside the SM โ called ๐๐๐๐ง๐๐ ๐๐๐ข๐ค๐ง๐ฎ.
๐ Why is Shared Memory so important? 1๏ธโฃ Programmers are allowed to use it in code Just use the shared keyword in front of the variable you want to place in shared memory.
This memory is: โข Statically allocated โข The size is known at compile time โข The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel โข SM uses this information to decide whether it can schedule a thread block based on its available shared memory
Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.
2๏ธโฃ Shared across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads. That also means that all the threads part of a thread block have to execute on the same SM.
โ ๏ธ Whether itโs 32, 256, or 1024 threads โ they all share the same block of shared memory. ๐ Finding the right balance between threads per block and shared memory usage is crucial for performance.
3๏ธโฃ You must manually syncrhonize Access must be explicitly synchronized. Warps might compete for shared memory access, so itโs on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.
4๏ธโฃ Atomic operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.
5๏ธโฃ Huge performance benefits Used correctly, it unlocks powerful optimizations: โข Tiling โข Histograms โข Sorting
๐ Follow for more GPU insights! #GPU #GPUProgramming #CUDA #NVIDIA #AMD