gpudemystified

#Linkedin

𝙎𝙝𝙖𝙧𝙚𝙙 𝙈𝙚𝙢𝙤𝙧𝙮 Clearly, one of the most important concepts in GPU programming and architecture.

🧠 The main idea? Pretty straightforward: 𝙖𝙘𝙘𝙚𝙨𝙨𝙞𝙣𝙜 𝙢𝙖𝙞𝙣 𝙢𝙚𝙢𝙤𝙧𝙮 𝙞𝙨 𝙚𝙭𝙩𝙧𝙚𝙢𝙚𝙡𝙮 𝙨𝙡𝙤𝙬 compared to how fast GPU SIMD units can execute instructions — we’re talking about an order of magnitude. For example, the NVIDIA A100 can retrieve up to 𝟮 𝗧𝗕/𝘀 of data from memory to cores, while its cores can execute up to 𝟭𝟵.𝟱 𝗧𝗙𝗟𝗢𝗣𝗦 𝗙𝗣𝟯𝟮 or 𝟯𝟭𝟮 𝗧𝗙𝗟𝗢𝗣𝗦 𝗙𝗣𝟭𝟲!

👉 To overcome this bottleneck, GPU vendors designed a hierarchy of caches to exploit data reuse: L2, L1, and most importantly, a 𝙨𝙢𝙖𝙡𝙡 (48–164 KB), 𝙫𝙚𝙧𝙮 𝙛𝙖𝙨𝙩, 𝙤𝙣-𝙘𝙝𝙞𝙥 memory inside the Streaming Multiprocessor (SM) — called 𝙎𝙝𝙖𝙧𝙚𝙙 𝙈𝙚𝙢𝙤𝙧𝙮. 🗒️ AMD calls it LDS (Local Data Share) - same concept, different name.

🌟 Why is Shared Memory so important? 1️⃣ Programmers are 𝙖𝙡𝙡𝙤𝙬𝙚𝙙 𝙩𝙤 𝙪𝙨𝙚 𝙞𝙩 𝙞𝙣 𝙘𝙤𝙙𝙚 You can explicitly tell the compiler: “Allocate this chunk in shared memory.” Just use the shared keyword in front of the variable you want to place in shared memory.

This memory is: • Statically allocated • The size is known at compile time • The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel • The SM uses this information to decide whether it can schedule a thread block based on its available shared memory

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

Example: shared int s_mem[256];

2️⃣ 𝙎𝙝𝙖𝙧𝙚𝙙 across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads — something GPU programmers always want. – that also means that all the threads part of a thread block have to execute on the same SM.

⚠️ Whether it’s 32, 256, or 1024 threads — they all share the same block of shared memory. 👉 Finding the right balance between threads per block and shared memory usage is crucial for performance.

3️⃣ You must 𝙢𝙖𝙣𝙪𝙖𝙡𝙡𝙮 𝙨𝙮𝙣𝙘𝙝𝙧𝙤𝙣𝙞𝙯𝙚 Access must be explicitly synchronized. Warps might compete for shared memory access, so it’s on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.

4️⃣ 𝘼𝙩𝙤𝙢𝙞𝙘 operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.

5️⃣ Huge 𝙥𝙚𝙧𝙛𝙤𝙧𝙢𝙖𝙣𝙘𝙚 benefits Used correctly, it unlocks powerful optimizations: • Tiling techniques for image filtering or matrix multiplication • Histograms • Sorting algorithms

🔥 Coming up: We’ll see how to practically use shared memory for the most common use cases.

📱 Don’t forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8

#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified

#Instagram Clearly, one of the most important concepts in GPU programming and architecture.

🧠 The main idea? Pretty straightforward: accessing main memory is extremely slow compared to how fast GPU SIMD units can execute instructions. For example, the NVIDIA A100 retrieves up to 2 TB/s of data from memory to cores, while its cores can execute up to 19.5 TFLOPS FP32 or 312 TFLOPS FP16!

👉 To overcome this bottleneck, GPU vendors designed a hierarchy of caches: L2, L1, and most importantly, a small (48–164 KB), very fast, on-chip memory inside the SM — called 𝙎𝙝𝙖𝙧𝙚𝙙 𝙈𝙚𝙢𝙤𝙧𝙮.

🌟 Why is Shared Memory so important? 1️⃣ Programmers are allowed to use it in code Just use the shared keyword in front of the variable you want to place in shared memory.

This memory is: • Statically allocated • The size is known at compile time • The size info is packed by the compiler into a metadata block that is sent to the GPU along with the kernel • SM uses this information to decide whether it can schedule a thread block based on its available shared memory

Once a thread block is scheduled on an SM, it occupies the required amount of shared memory until it finishes execution.

2️⃣ Shared across the thread block Every thread in a thread block has access to the same shared memory space. This allows fast communication and synchronization between threads. That also means that all the threads part of a thread block have to execute on the same SM.

3️⃣ You must manually syncrhonize Access must be explicitly synchronized. Warps might compete for shared memory access, so it’s on the programmer to manage that. CUDA & HLSL provide __syncthreads() function for this purpose.

4️⃣ Atomic operations Supports fast atomic operations (at a thread block level), leading to the concept of privatization.

5️⃣ Huge performance benefits Used correctly, it unlocks powerful optimizations: • Tiling • Histograms • Sorting

👉 Follow for more GPU insights! #GPU #GPUProgramming #CUDA #NVIDIA #AMD

Pro Benefits

Hi, User!

Report a Problem