Let’s look at how to program GPUs using CUDA, with a simple example: vector addition.

CUDA lets you write C code that runs on the GPU, in the form of functions known as 𝙠𝙚𝙧𝙣𝙚𝙡𝙨. These functions are written in C and can be called from the CPU (host) side just like regular functions, but they execute on the GPU.

The idea:

Define the CUDA kernel — a function where each thread adds a pair of elements from vectors A and B. Use __global__ to mark a function as a kernel.
Set up the data the kernel needs — input arrays and an output array. CUDA functions like cudaMalloc and cudaMemcpy handle GPU memory allocation and data transfer.
Launch the kernel from the CPU using this syntax:
```
myKernel<<>>(args);
```

Threads are grouped into blocks. The total number of threads is:

numBlocks × numThreadsPerBlock

Other notes:

CUDA is one option; alternatives include OpenCL or graphics APIs like DirectX/Vulkan + HLSL/GLSL.
Fun fact: NVIDIA invested $10 billion into developing CUDA.
Recommended resources:
- 📚 NVIDIA’s official CUDA documentation: link
- 💻 NVIDIA’s CUDA Samples repo: link
- 🌀 NVIDIA Donut Samples (DirectX + HLSL): link