GPU Performance Primitives: Tiling, Scan, Reduction, Roofline

June 1, 2026 • Interview Review

For GPU interview practice, I want every kernel explanation to answer four questions: what each thread computes, what memory pattern it creates, where synchronization or atomics enter, and what bottleneck I would test first.

The Four-Bullet Kernel Drill

Thread/data mapping: Which element, row, tile, or partial result does a thread own?
Memory access pattern: Are global loads/stores coalesced? Is reuse staged in shared memory?
Synchronization or atomics: Is there block-local cooperation, cross-block staging, or a contended update?
Bottleneck hypothesis: Is the first suspect bandwidth, compute, occupancy, bank conflicts, divergence, launch overhead, or atomics?

Shared-Memory Transpose

Naive matrix transpose has an unavoidable-looking global-memory problem: either the read is coalesced and the write is strided, or the read is strided and the write is coalesced. Shared-memory tiling changes the problem.

Read a tile from global memory into shared memory with coalesced access.
Synchronize the block.
Read the tile transposed from shared memory.
Write to global memory with coalesced access.

The catch is shared-memory bank conflicts. For a 32-by-32 float tile, transposed access can make many threads in a warp hit the same bank. Padding changes the stride:

__shared__ float tile[32][33];

The extra column spreads accesses over banks and often fixes the worst conflict pattern.

Tiled Matrix Multiplication

Naive GPU matrix multiplication reloads values from `A` and `B` from global memory for many output elements. Tiling uses shared memory for reuse.

Thirty-second answer. In naive GPU matmul, each output element reloads many values from global memory. With tiling, each block cooperatively loads a tile of `A` and `B` into shared memory, synchronizes, and reuses those values for many multiply-adds. The tradeoff is that tile size consumes shared memory and registers, so it affects occupancy. The main win is reduced global-memory traffic.

Transpose and matmul both use shared memory, but for different reasons. Transpose mainly uses shared memory to make global writes coalesced. Matmul mainly uses shared memory to reuse loaded data.

Reduction

A sum reduction maps each thread to one or more input elements, then reduces partial sums inside a block. Larger reductions usually write block-level partial sums, then launch another stage or use a library primitive.

thread/data mapping:
  each thread loads one or more elements

memory pattern:
  coalesced global loads, then shared-memory partials

synchronization:
  __syncthreads() between tree levels

bottleneck hypothesis:
  memory bandwidth first, then synchronization and occupancy

Atomics can be correct for some reductions, but a global atomic hot spot often serializes too much work. Use them carefully and measure.

Prefix Scan

Prefix scan converts values into running totals. The exclusive version starts with zero:

input:     [3, 1, 7, 0, 4, 1, 6, 3]
exclusive: [0, 3, 4, 11, 11, 15, 16, 22]

A sequential scan has a loop-carried dependency. GPU scan restructures the work as a tree:

Load a chunk into shared memory.
Up-sweep to build partial sums.
Set the final total to zero for exclusive scan.
Down-sweep to distribute prefixes.
Store the scanned result.

For arrays larger than one block, scan each block independently, write block sums, scan those block sums, and add scanned block offsets back to each chunk.

Scan matters because it turns independent local decisions into compact output positions. That is why it shows up in stream compaction, radix sort, groupby output placement, and hash join materialization.

Roofline Model

The roofline model asks whether a kernel is limited by memory bandwidth or compute throughput. The key quantity is arithmetic intensity:

I = FLOPs / bytes moved

If peak memory bandwidth is `B`, then memory bandwidth can support at most `I * B` FLOPs per second. If peak compute throughput is `P_max`, the bound is:

P <= min(I * B, P_max)

Low arithmetic intensity usually means memory-bound. High arithmetic intensity can become compute-bound. Real kernels often land below the roof because of bad coalescing, cache misses, register spills, low occupancy, branch divergence, synchronization overhead, or insufficient parallelism.

Production Instinct

For standard primitives such as scan, sort, reduction, and dense linear algebra, start with tuned libraries such as CUB, Thrust, or cuBLAS. Hand-written kernels are worth discussing when the goal is learning, fusion with a custom operator, unusual data layout, or a performance experiment with a clear measurement plan.

Review sentence. I would first state the thread mapping and memory pattern, then identify the likely bottleneck, then say what I would profile and change. That keeps the answer grounded in measurement instead of GPU folklore.