NVIDIA Fermi marks a major milestone in the evolution of the Streaming Multiprocessor (SM), bringing it closer to what we have in today’s GPUs. Manufactured using a 40nm process node, Fermi practically doubled or quadrupled the computational capabilities compared to Tesla.

🌟 The new SM architecture can now keep up to 48 warps in flight and packs ~5Γ— more cores than Tesla: 32 CUDA cores, 16 load/store units, and 4 SFUs. It also features larger shared memory and can schedule 2 warps per cycle for higher throughput.

Another game-changing improvement: atomic operations are now handled in L2 cache, rather than DRAM, giving a 5×–20Γ— performance boost for atomic-heavy workloads.

βš™οΈ Fermi in numbers:

Fermi clearly paved the way for modern GPU architectures. Next up, we’ll explore Kepler β€” and if you thought Fermi transformed the Streaming Multiprocessor, just wait to see what Kepler brings to the table.