Another fundamental pattern in GPU programming, used to categorize data — from text processing to image analysis.

Examples include:

Counting how many characters or words of a certain type appear in a string
Counting zeros in a list of integers (e.g. for radix sort)
Calculating how many pixels of a certain color exist in an image

👉 The simplest approach:
For each input element, compute its category (or bin) and increment it.
To use the GPU’s capabilities effectively we can launch one thread per element (e.g., per character or pixel). However, this introduces a classic concurrency challenge: multiple threads may try to update the same bin simultaneously, causing data races.

🔧 The fix? Atomic operations, which synchronize access to memory between threads — CUDA has supported atomics since the Tesla architecture (2006).

However, atomic operations are well known for being slow. The access to memory has to be serialized when multiple threads update the same location and the only synchronization point across multiple Streaming Multiprocessors (where thread blocks execute) is the L2 cache — which, while faster than main memory, is still relatively slow.

🔥 This is where 𝙨𝙝𝙖𝙧𝙚𝙙 𝙢𝙚𝙢𝙤𝙧𝙮 plays a key role.
We’ve seen that shared memory supports atomic operations—there are specific assembly instructions for this (see the post on shared memory)—and we know that it is faster to access than any other memory on the GPU.

A common optimization pattern is privatization:

Each thread block maintains its own local histogram in shared memory
Threads perform fast, local atomic updates to shared bins
At the end, results are merged into global memory

This drastically reduces the number of global atomics and improves performance (up to 93% for some use cases).

Basically, instead of every thread block fighting over L2 cache, we’ve created "local contention zones" (one per thread block) that scale better — moving most atomic operations to shared memory.

🗒️ Note: The cost of “publishing” — merging the results from shared memory back to global memory — is not free. In some cases, especially when the number of atomic operations is low, this overhead can exceed the cost of performing atomic operations directly in global memory.

And there’s more:

Warp-level privatization: individual warps use registers for counting, then write to shared memory
Thread clusters & distributed shared memory (Ampere+): allow multiple thread blocks to cooperate on the same shared memory space