Linkedin

π˜½π™‘π™ͺ𝙧 π™›π™žπ™‘π™©π™šπ™§ β€” one of the most widely used techniques in image processing β€” applied in visual effects, downsampling, smoothing, motion blur, privacy & obfuscation, and more. There are different types of blur filters such as box blur, gaussian blur, and radial blur β€” each with its own advantages and trade-offs, and various strategies for GPU implementation like single-pass, hardware downsampling, and 2-tap passes.

There are plenty of articles online covering various blur techniques, so I won’t go into that here. In this post, I’ll focus on one of the simplest forms: the box blur. We’ll implement it in two ways β€” a basic version without shared memory, and an optimized version that leverages the GPU’s shared memory for better performance β€” and compare the results.

πŸ‘‰ Box Blur The idea is simple: 𝙛𝙀𝙧 π™šπ™–π™˜π™ π™₯π™žπ™­π™šπ™‘, π™œπ™žπ™«π™šπ™£ 𝙖 π™§π™–π™™π™žπ™ͺ𝙨, π™¬π™š π™§π™šπ™–π™™ π™žπ™©π™¨ π™£π™šπ™žπ™œπ™π™—π™€π™§π™¨, π™–π™«π™šπ™§π™–π™œπ™š π™©π™π™šπ™žπ™§ π™˜π™€π™‘π™€π™§ 𝙫𝙖𝙑π™ͺπ™šπ™¨, 𝙖𝙣𝙙 π™¬π™§π™žπ™©π™š π™©π™π™š π™§π™šπ™¨π™ͺ𝙑𝙩. The larger the radius, the stronger the blur effect.

To apply the blur across the image, we need to launch one thread per pixel. Remember, CUDA doesn’t simply launch a flat number of threadsβ€”you define a grid of thread blocks and specify how many threads each block contains.

In this example, we use a 2D π™—π™‘π™€π™˜π™ π™Žπ™žπ™―π™š = 16Γ—16 to match the image’s 2D nature. Then, we compute how many blocks (2D, again) are needed to cover the image: π™œπ™§π™žπ™™π™Žπ™žπ™―π™š = π™žπ™’π™–π™œπ™šπ™Žπ™žπ™―π™š / π™—π™‘π™€π™˜π™ π™Žπ™žπ™―π™š

Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic. That’s all you need for a working blur.

🌟 But here’s the catch: each thread reads RADIUS Γ— RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage. πŸ”₯ And here’s the opportunity: neighboring threads within a block read overlapping regions.

Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory. This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.

As we’ll see, thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32Γ—32 blocks instead of 16Γ—16?

This is why understanding the GPU architecture matters β€” small changes, like proper use of shared memory, can lead to major performance gains.

The full source code from this post will be available on GitHub.

In the following posts, we’ll revisit the blur filter and explore how to make it even faster using techniques like intra-warp communication and new hardware features such as distributed shared memory.

πŸ“± Don’t forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8

#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified

Instagram

π˜½π™‘π™ͺ𝙧 π™›π™žπ™‘π™©π™šπ™§ β€” one of the most widely used techniques in image processing.

There are different types of blur filters such as box blur, gaussian blur, and radial blur β€” each with its own advantages and trade-offs, and various strategies to implement them.

In this post, I’ll focus on the box blur. We’ll implement it in two waysβ€” a basic version without shared memory, and an optimized version that leverages the GPU’s shared memory for better performance β€” and compare the results.

πŸ‘‰ Box Blur The idea is simple: for each pixel, given a radius, we read its neighbors, average their color values, and write the result. The larger the radius, the stronger the blur.

To apply the blur across the image, we need to launch one thread per pixel. Remember, CUDA doesn’t simply launch a flat number of threadsβ€”you define a grid of thread blocks and specify how many threads each block contains.

In this example, we use a 2D π™—π™‘π™€π™˜π™ π™Žπ™žπ™―π™š = 16Γ—16. Then, we compute how many blocks (2D, again) are needed to cover the image: π™œπ™§π™žπ™™π™Žπ™žπ™―π™š = π™žπ™’π™–π™œπ™šπ™Žπ™žπ™―π™š / π™—π™‘π™€π™˜π™ π™Žπ™žπ™―π™š

Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic. That’s all you need for a working blur.

🌟 But here’s the catch: each thread reads RADIUS Γ— RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage. πŸ”₯ And here’s the opportunity: neighboring threads within a block read overlapping regions.

Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory. This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.

Thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32Γ—32 blocks instead of 16Γ—16?

Source code -> GitHub.

πŸ‘‰ Follow for more GPU insights! #gpu #gpuprogramming #cuda #nvidia