CUDA Fundamentals For Systems Interviews
This is the compact version of my CUDA basics notes: enough vocabulary to explain a kernel, reason about memory behavior, and know where synchronization actually happens.
Thirty-second answer. CUDA programs have host code and device code. Host code runs on the CPU: it allocates device memory, copies inputs, launches kernels, checks errors, synchronizes when it needs results, and copies data back. Device code runs on the GPU as kernels. A launch creates a grid of thread blocks; blocks contain threads; NVIDIA hardware executes threads in warps of 32 using SIMT execution. Performance usually comes down to memory traffic, coalescing, occupancy, divergence, and synchronization overhead.
Host, Device, Kernel
A CUDA kernel is a function that runs on the GPU. The CPU launches it with the triple-chevron syntax:
add_one<<<blocks, threads_per_block>>>(d_x, d_y, n);
The host call is asynchronous with respect to the CPU thread. The CPU can continue immediately while the GPU executes queued work. The host only knows the work is complete after an explicit wait or an operation that implies a wait.
add_one<<<blocks, threads>>>(d_x, d_y, n);
cudaDeviceSynchronize();
cudaMemcpy(h_y, d_y, n * sizeof(int), cudaMemcpyDeviceToHost);
Common synchronization points are `cudaDeviceSynchronize()`, `cudaStreamSynchronize(stream)`, event synchronization, and blocking device-to-host copies in the same stream.
Grid, Block, Thread, Warp
A kernel launch creates a grid. A grid contains blocks. A block contains threads. Blocks are assigned to streaming multiprocessors, or SMs, and each block stays on one SM until it finishes. Inside the SM, warp schedulers issue instructions from ready warps.
int i = blockIdx.x * blockDim.x + threadIdx.x;
For 2D data, the same idea maps naturally onto rows and columns:
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;
int idx = row * width + col;
Grid-stride loops are useful when the problem is larger than the total number of launched threads:
for (int i = blockIdx.x * blockDim.x + threadIdx.x;
i < n;
i += blockDim.x * gridDim.x) {
y[i] = x[i] + 1;
}
The important correctness rule is that blocks are independent. They may run in parallel, serially, or in any order. A normal kernel should not depend on block execution order.
Occupancy And Resources
More than one block can reside on an SM if resources allow. The limiting resource might be threads, warps, blocks, registers, or shared memory. For example, if a block uses 256 threads, 32 registers per thread, and 16 KB of shared memory, the SM must have room for all of that before another block can be resident.
Occupancy is useful because ready warps can hide latency. It is not a goal by itself. A kernel with very high occupancy can still be slow if every warp makes scattered memory requests or waits on heavy synchronization.
Memory Hierarchy
A practical CUDA memory model:
global memory / VRAM
- cached through shared L2 and often per-SM L1
shared memory
- on-chip scratchpad
- allocated per block
- explicit and programmer-managed
registers
- private logical storage per thread
- allocated from each SM's physical register file
Caches are mostly automatic. You influence them through access patterns. Shared memory is different: the program explicitly stages data there and synchronizes before other threads use it.
__shared__ float tile[256];
int tid = threadIdx.x;
tile[tid] = input[blockIdx.x * blockDim.x + tid];
__syncthreads();
float x = tile[tid];
Dynamic shared memory is declared as an unsized array and sized at launch:
extern __shared__ int temp[];
block_scan<<<blocks, threads, 2 * threads * sizeof(int)>>>(d_in, d_out);
Coalescing And Warps
A warp has 32 threads. Coalescing is reasoned about at warp level. Neighboring threads should usually access neighboring memory addresses so the hardware can service requests with fewer aligned memory transactions.
// Good: neighboring threads read neighboring floats.
float v = x[blockIdx.x * blockDim.x + threadIdx.x];
// Usually bad: neighboring threads jump by stride.
float w = x[(blockIdx.x * blockDim.x + threadIdx.x) * stride];
This is cache-line-like thinking, but the CUDA term is global-memory coalescing.
Managed Memory And Explicit Copies
`cudaMallocManaged` creates one pointer usable by CPU and GPU. The runtime migrates pages between CPU memory and GPU memory as needed. That is convenient, but it is not free. If the CPU and GPU repeatedly alternate touching the same pages, migration can dominate runtime.
For predictable movement, explicit ownership is often clearer:
cudaMalloc(&d_A, bytes);
cudaMemcpy(d_A, h_A, bytes, cudaMemcpyHostToDevice);
kernel<<<blocks, threads>>>(d_A);
cudaMemcpy(h_A, d_A, bytes, cudaMemcpyDeviceToHost);
cudaFree(d_A);
In C++, wrap device allocations with RAII and a custom deleter if possible. A normal `std::unique_ptr<T[]>` calls `delete[]`, not `cudaFree`.
Compilation Model
`nvcc` separates host and device compilation. Host code becomes normal CPU machine code. Device code goes through NVIDIA's compiler pipeline:
CUDA C++ device code
-> NVVM IR
-> optimized IR
-> PTX
-> ptxas or driver JIT
-> SASS / cubin
PTX is virtual GPU assembly. SASS is architecture-specific GPU machine code. A fat binary can contain native cubins for some architectures plus PTX fallback for driver JIT compilation.
Review Checklist
- What runs on the CPU, and what runs on the GPU?
- How does a thread compute its global index?
- Why should neighboring threads access neighboring addresses?
- What is the difference between cache and shared memory?
- What does `cudaDeviceSynchronize()` wait for?
- Why check kernel launch errors?
- How can high register use reduce occupancy?