Kepler took efficiency to the next level. While keeping roughly the same number of SMs as Fermi, NVIDIA packed way more cores into each SMX, increased shared memory, and boosted throughput — all while consuming much less power — achieved by lowering the clock and unifying the core clock with the card clock.

🌟 The new generation of the Streaming Multiprocessor, the SMX (Streaming Multiprocessor eXtension), is a beast. It could keep 64 warps in flight, with 192 CUDA cores, 64 Double Precision Cores, 32 SFUs, and 32 load/store units per SMX (GK110). Each SMX has 4 warp schedulers, with 2 dispatch units per scheduler, allowing multiple instructions to be issued per cycle. Deterministic instruction latencies also let the compiler provide scheduling info, reducing hardware complexity and saving power.

Warp scheduling, shuffle instructions, and dynamic parallelism made the GPU smarter, faster, and more autonomous.

💡 Key Features

Dynamic Parallelism: The GPU can generate new work for itself, manage dependencies, and synchronize results, without CPU involvement
Warp Shuffle (SHFL): Threads within a warp can share data directly without going through shared memory. This reduces latency and saves power. Improvements for algorithms such as FFT/Convolutions
Advanced Warp Scheduling: Each SMX has 4 warp schedulers with 2 dispatch units per scheduler, allowing multiple instructions from the same warp to be scheduled for execution in the same cycle
More cores, more memory on the SMX: Huge boost in performance per watt

⚙️ Kepler in Numbers

SMXs: 15 SMXs
Threads: 64 warps/SMX → 2,048 threads/SMX → ~30,720 threads total
Execution units: 192 CUDA cores + 64 DP FP + 32 SFUs + 32 LD/ST per SMX
Peak compute: Up to 4.7 TFLOPs SP
Shared memory: 128 KB per SMX
L2 cache: 1.5 MB shared
Global memory: 6 GB GDDR5, ~288 GB/s