𝘽𝙡𝙪𝙧 𝙛𝙞𝙡𝙩𝙚𝙧 — one of the most widely used techniques in image processing — applied in visual effects, downsampling, smoothing, motion blur, privacy & obfuscation, and more. There are different types of blur filters such as box blur, gaussian blur, and radial blur — each with its own advantages and trade-offs, and various strategies for GPU implementation like single-pass, hardware downsampling, and 2-tap passes.

There are plenty of articles online covering various blur techniques, so I won’t go into that here. In this post, I’ll focus on one of the simplest forms: the box blur. We’ll implement it in two ways — a basic version without shared memory, and an optimized version that leverages the GPU’s shared memory for better performance — and compare the results.

👉 Box Blur The idea is simple: 𝙛𝙤𝙧 𝙚𝙖𝙘𝙝 𝙥𝙞𝙭𝙚𝙡, 𝙜𝙞𝙫𝙚𝙣 𝙖 𝙧𝙖𝙙𝙞𝙪𝙨, 𝙬𝙚 𝙧𝙚𝙖𝙙 𝙞𝙩𝙨 𝙣𝙚𝙞𝙜𝙝𝙗𝙤𝙧𝙨, 𝙖𝙫𝙚𝙧𝙖𝙜𝙚 𝙩𝙝𝙚𝙞𝙧 𝙘𝙤𝙡𝙤𝙧 𝙫𝙖𝙡𝙪𝙚𝙨, 𝙖𝙣𝙙 𝙬𝙧𝙞𝙩𝙚 𝙩𝙝𝙚 𝙧𝙚𝙨𝙪𝙡𝙩. The larger the radius, the stronger the blur effect.

To apply the blur across the image, we need to launch one thread per pixel. Remember, CUDA doesn’t simply launch a flat number of threads—you define a grid of thread blocks and specify how many threads each block contains.

In this example, we use a 2D 𝙗𝙡𝙤𝙘𝙠𝙎𝙞𝙯𝙚 = 16×16 to match the image’s 2D nature. Then, we compute how many blocks (2D, again) are needed to cover the image: 𝙜𝙧𝙞𝙙𝙎𝙞𝙯𝙚 = 𝙞𝙢𝙖𝙜𝙚𝙎𝙞𝙯𝙚 / 𝙗𝙡𝙤𝙘𝙠𝙎𝙞𝙯𝙚

Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic. That’s all you need for a working blur.

🌟 But here’s the catch: each thread reads RADIUS × RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage. 🔥 And here’s the opportunity: neighboring threads within a block read overlapping regions.

Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory. This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.

As we’ll see, thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32×32 blocks instead of 16×16?

This is why understanding the GPU architecture matters — small changes, like proper use of shared memory, can lead to major performance gains.

The full source code from this post will be available on GitHub.

In the following posts, we’ll revisit the blur filter and explore how to make it even faster using techniques like intra-warp communication and new hardware features such as distributed shared memory.

📱 Don’t forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8

#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified

Instagram

𝘽𝙡𝙪𝙧 𝙛𝙞𝙡𝙩𝙚𝙧 — one of the most widely used techniques in image processing.

There are different types of blur filters such as box blur, gaussian blur, and radial blur — each with its own advantages and trade-offs, and various strategies to implement them.

In this post, I’ll focus on the box blur. We’ll implement it in two ways— a basic version without shared memory, and an optimized version that leverages the GPU’s shared memory for better performance — and compare the results.

👉 Box Blur The idea is simple: for each pixel, given a radius, we read its neighbors, average their color values, and write the result. The larger the radius, the stronger the blur.

In this example, we use a 2D 𝙗𝙡𝙤𝙘𝙠𝙎𝙞𝙯𝙚 = 16×16. Then, we compute how many blocks (2D, again) are needed to cover the image: 𝙜𝙧𝙞𝙙𝙎𝙞𝙯𝙚 = 𝙞𝙢𝙖𝙜𝙚𝙎𝙞𝙯𝙚 / 𝙗𝙡𝙤𝙘𝙠𝙎𝙞𝙯𝙚

Thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32×32 blocks instead of 16×16?

Source code -> GitHub.

👉 Follow for more GPU insights! #gpu #gpuprogramming #cuda #nvidia

gpudemystified

Linkedin

Linkedin

Instagram

Linkedin

Instagram

Pro Benefits

Hi, User!

Report a Problem