Kepler took efficiency to the next level. While keeping roughly the same number of SMs as Fermi, NVIDIA packed way more cores into each SMX, increased shared memory, and boosted throughput — all while consuming much less power — achieved by lowering the clock and unifying the core clock with the card clock.
🌟 The new generation of the Streaming Multiprocessor, the SMX (Streaming Multiprocessor eXtension), is a beast. It could keep 64 warps in flight, with 192 CUDA cores, 64 Double Precision Cores, 32 SFUs, and 32 load/store units per SMX (GK110). Each SMX has 4 warp schedulers, with 2 dispatch units per scheduler, allowing multiple instructions to be issued per cycle. Deterministic instruction latencies also let the compiler provide scheduling info, reducing hardware complexity and saving power.
Warp scheduling, shuffle instructions, and dynamic parallelism made the GPU smarter, faster, and more autonomous.