NVIDIA Fermi marks a major milestone in the evolution of the Streaming Multiprocessor (SM), bringing it closer to what we have in todayβs GPUs. Manufactured using a 40nm process node, Fermi practically doubled or quadrupled the computational capabilities compared to Tesla.
π The new SM architecture can now keep up to 48 warps in flight and packs ~5Γ more cores than Tesla: 32 CUDA cores, 16 load/store units, and 4 SFUs. It also features larger shared memory and can schedule 2 warps per cycle for higher throughput.
Another game-changing improvement: atomic operations are now handled in L2 cache, rather than DRAM, giving a 5Γβ20Γ performance boost for atomic-heavy workloads.
Fermi clearly paved the way for modern GPU architectures. Next up, weβll explore Kepler β and if you thought Fermi transformed the Streaming Multiprocessor, just wait to see what Kepler brings to the table.