One of the most widely used techniques in image processing — applied in visual effects, downsampling, smoothing, motion blur, privacy & obfuscation, and more.

There are different types of blur filters such as box blur, gaussian blur, and radial blur — each with its own advantages and trade-offs, and various strategies for GPU implementation like single-pass, hardware downsampling, and 2-tap passes.

👉 Box Blur
The idea is simple: for each pixel, given a radius, we read its neighbors, average their color values, and write the result. The larger the radius, the stronger the blur effect.

To apply the blur across the image, we need to launch one thread per pixel.
Remember, CUDA doesn’t simply launch a flat number of threads—you define a grid of thread blocks and specify how many threads each block contains.

In this example, we use a 2D blockSize = 16×16 to match the image’s 2D nature.
Then, we compute how many blocks (2D, again) are needed to cover the image:
gridSize = imageSize / blockSize

Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic.
That’s all you need for a working blur.

🌟 But here’s the catch: each thread reads RADIUS × RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage.
🔥 And here’s the opportunity: neighboring threads within a block read overlapping regions.

Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory.
This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.

As we’ll see, thread block size and blur radius directly affect the performance.
Can you guess what would happen if we use 32×32 blocks instead of 16×16?

This is why understanding the GPU architecture matters — small changes, like proper use of shared memory, can lead to major performance gains.

The full source code from this post will be available on GitHub.

In the following posts, we’ll revisit the blur filter and explore how to make it even faster using techniques like intra-warp communication and new hardware features such as distributed shared memory.