#Linkedin

๐™Ž๐™๐™–๐™ง๐™š๐™™ ๐™ˆ๐™š๐™ข๐™ค๐™ง๐™ฎ Clearly, one of the most important concepts in GPU programming and architecture.

๐Ÿง  The main idea? Pretty straightforward: ๐™–๐™˜๐™˜๐™š๐™จ๐™จ๐™ž๐™ฃ๐™œ ๐™ข๐™–๐™ž๐™ฃ ๐™ข๐™š๐™ข๐™ค๐™ง๐™ฎ ๐™ž๐™จ ๐™š๐™ญ๐™ฉ๐™ง๐™š๐™ข๐™š๐™ก๐™ฎ ๐™จ๐™ก๐™ค๐™ฌ compared to how fast GPU SIMD units can execute instructions โ€” weโ€™re talking about an order of magnitude. For example, the NVIDIA A100 can retrieve up to ๐Ÿฎ ๐—ง๐—•/๐˜€ of data from memory to cores, while its cores can execute up to ๐Ÿญ๐Ÿต.๐Ÿฑ ๐—ง๐—™๐—Ÿ๐—ข๐—ฃ๐—ฆ ๐—™๐—ฃ๐Ÿฏ๐Ÿฎ or ๐Ÿฏ๐Ÿญ๐Ÿฎ ๐—ง๐—™๐—Ÿ๐—ข๐—ฃ๐—ฆ ๐—™๐—ฃ๐Ÿญ๐Ÿฒ!

๐Ÿ‘‰ To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse: L2, L1, and most importantly, a ๐™จ๐™ข๐™–๐™ก๐™ก (48โ€“164 KB), ๐™ซ๐™š๐™ง๐™ฎ ๐™›๐™–๐™จ๐™ฉ, ๐™ค๐™ฃ-๐™˜๐™๐™ž๐™ฅ memory inside the Streaming Multiprocessor (SM) โ€” called ๐™Ž๐™๐™–๐™ง๐™š๐™™ ๐™ˆ๐™š๐™ข๐™ค๐™ง๐™ฎ. ๐Ÿ—’๏ธ AMD calls it LDS (Local Data Share) - same concept, different name.

๐ŸŒŸ Why is Shared Memory so important? 1๏ธโƒฃ Programmers are ๐™–๐™ก๐™ก๐™ค๐™ฌ๐™š๐™™ ๐™ฉ๐™ค ๐™ช๐™จ๐™š ๐™ž๐™ฉ ๐™ž๐™ฃ ๐™˜๐™ค๐™™๐™š You can explicitly tell the compiler: โ€œAllocate this chunk in shared memory.โ€ Just use the shared keyword in front of the variable you want to place in shared memory.

This memory is: โ€ข Statically allocated โ€ข The size is known at compile time โ€ข The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel โ€ข The SM uses this information to decide whether it can schedule a thread block based on its available shared memory

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

Example: shared int s_mem[256];

2๏ธโƒฃ ๐™Ž๐™๐™–๐™ง๐™š๐™™ across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads โ€” something GPU programmers always want. โ€“ that also means that all the threads part of a thread block have to execute on the same SM.

โš ๏ธ Whether itโ€™s 32, 256, or 1024 threads โ€” they all share the same block of shared memory. ๐Ÿ‘‰ Finding the right balance between threads per block and shared memory usage is crucial for performance.

3๏ธโƒฃ You must ๐™ข๐™–๐™ฃ๐™ช๐™–๐™ก๐™ก๐™ฎ ๐™จ๐™ฎ๐™ฃ๐™˜๐™๐™ง๐™ค๐™ฃ๐™ž๐™ฏ๐™š Access must be explicitly synchronized. Warps might compete for shared memory access, so itโ€™s on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.

4๏ธโƒฃ ๐˜ผ๐™ฉ๐™ค๐™ข๐™ž๐™˜ operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.

5๏ธโƒฃ Huge ๐™ฅ๐™š๐™ง๐™›๐™ค๐™ง๐™ข๐™–๐™ฃ๐™˜๐™š benefits Used correctly, it unlocks powerful optimizations: โ€ข Tiling techniques for image filtering or matrix multiplication โ€ข Histograms โ€ข Sorting algorithms

๐Ÿ”ฅ Coming up: Weโ€™ll see how to practically use shared memory for the most common use cases.

๐Ÿ“ฑ Donโ€™t forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8

#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified

#Instagram Clearly, one of the most important concepts in GPU programming and architecture.

๐Ÿง  The main idea? Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions. For example, the NVIDIA A100 retrieves up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!

๐Ÿ‘‰ To overcome this bottleneck, GPU vendors designed a hierarchy of caches: L2, L1, and most importantly, a small (48โ€“164 KB), very fast, on-chip memory inside the SM โ€” called ๐™Ž๐™๐™–๐™ง๐™š๐™™ ๐™ˆ๐™š๐™ข๐™ค๐™ง๐™ฎ.

๐ŸŒŸ Why is Shared Memory so important? 1๏ธโƒฃ Programmers are allowed to use it in code Just use the shared keyword in front of the variable you want to place in shared memory.

This memory is: โ€ข Statically allocated โ€ข The size is known at compile time โ€ข The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel โ€ข SM uses this information to decide whether it can schedule a thread block based on its available shared memory

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

2๏ธโƒฃ Shared across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads. That also means that all the threads part of a thread block have to execute on the same SM.

โš ๏ธ Whether itโ€™s 32, 256, or 1024 threads โ€” they all share the same block of shared memory. ๐Ÿ‘‰ Finding the right balance between threads per block and shared memory usage is crucial for performance.

3๏ธโƒฃ You must manually syncrhonize Access must be explicitly synchronized. Warps might compete for shared memory access, so itโ€™s on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.

4๏ธโƒฃ Atomic operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.

5๏ธโƒฃ Huge performance benefits Used correctly, it unlocks powerful optimizations: โ€ข Tiling โ€ข Histograms โ€ข Sorting

๐Ÿ‘‰ Follow for more GPU insights! #GPU #GPUProgramming #CUDA #NVIDIA #AMD