π½π‘πͺπ§ πππ‘π©ππ§ β one of the most widely used techniques in image processing β applied in visual effects, downsampling, smoothing, motion blur, privacy & obfuscation, and more. There are different types of blur filters such as box blur, gaussian blur, and radial blur β each with its own advantages and trade-offs, and various strategies for GPU implementation like single-pass, hardware downsampling, and 2-tap passes.
There are plenty of articles online covering various blur techniques, so I wonβt go into that here. In this post, Iβll focus on one of the simplest forms: the box blur. Weβll implement it in two ways β a basic version without shared memory, and an optimized version that leverages the GPUβs shared memory for better performance β and compare the results.
π Box Blur The idea is simple: ππ€π§ ππππ π₯ππππ‘, πππ«ππ£ π π§ππππͺπ¨, π¬π π§πππ ππ©π¨ π£ππππππ€π§π¨, ππ«ππ§πππ π©ππππ§ ππ€π‘π€π§ π«ππ‘πͺππ¨, ππ£π π¬π§ππ©π π©ππ π§ππ¨πͺπ‘π©. The larger the radius, the stronger the blur effect.
To apply the blur across the image, we need to launch one thread per pixel. Remember, CUDA doesnβt simply launch a flat number of threadsβyou define a grid of thread blocks and specify how many threads each block contains.
In this example, we use a 2D ππ‘π€ππ πππ―π = 16Γ16 to match the imageβs 2D nature. Then, we compute how many blocks (2D, again) are needed to cover the image: ππ§πππππ―π = ππ’ππππππ―π / ππ‘π€ππ πππ―π
Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic. Thatβs all you need for a working blur.
π But hereβs the catch: each thread reads RADIUS Γ RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage. π₯ And hereβs the opportunity: neighboring threads within a block read overlapping regions.
Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory. This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.
As weβll see, thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32Γ32 blocks instead of 16Γ16?
This is why understanding the GPU architecture matters β small changes, like proper use of shared memory, can lead to major performance gains.
The full source code from this post will be available on GitHub.
In the following posts, weβll revisit the blur filter and explore how to make it even faster using techniques like intra-warp communication and new hardware features such as distributed shared memory.
π± Donβt forget that you can also find my posts on Instagram -> https://lnkd.in/dbKdgpE8
#GPU #GPUProgramming #GPUArchitecture #ParallelComputing #CUDA #NVIDIA #AMD #ComputerArchitecture #GPUDemystified
π½π‘πͺπ§ πππ‘π©ππ§ β one of the most widely used techniques in image processing.
There are different types of blur filters such as box blur, gaussian blur, and radial blur β each with its own advantages and trade-offs, and various strategies to implement them.
In this post, Iβll focus on the box blur. Weβll implement it in two waysβ a basic version without shared memory, and an optimized version that leverages the GPUβs shared memory for better performance β and compare the results.
π Box Blur The idea is simple: for each pixel, given a radius, we read its neighbors, average their color values, and write the result. The larger the radius, the stronger the blur.
To apply the blur across the image, we need to launch one thread per pixel. Remember, CUDA doesnβt simply launch a flat number of threadsβyou define a grid of thread blocks and specify how many threads each block contains.
In this example, we use a 2D ππ‘π€ππ πππ―π = 16Γ16. Then, we compute how many blocks (2D, again) are needed to cover the image: ππ§πππππ―π = ππ’ππππππ―π / ππ‘π€ππ πππ―π
Inside the kernel, each thread calculates its corresponding pixel position based on the block ID, block dimensions, and thread ID within the block and apply the blur logic. Thatβs all you need for a working blur.
π But hereβs the catch: each thread reads RADIUS Γ RADIUS pixels from global memory. This is slow and across thousands of threads, it adds up to significant memory bandwidth usage. π₯ And hereβs the opportunity: neighboring threads within a block read overlapping regions.
Instead of having each thread fetch its own data from global memory, we load a tile of the image (block size + padding) into shared memory. All threads in the block cooperate to read this data, then synchronize, and apply the blur using the shared memory. This significantly reduces global memory traffic and can result in up to a ~33% performance boost, depending on the hardware.
Thread block size and blur radius directly affect the performance. Can you guess what would happen if we use 32Γ32 blocks instead of 16Γ16?
Source code -> GitHub.
π Follow for more GPU insights! #gpu #gpuprogramming #cuda #nvidia