The three patterns behind most real GPU kernels: combining many values into one with a shared-memory tree reduction, letting threads update a shared location safely with atomics, and reusing data through shared-memory tiles in matrix multiplication.
One-thread-per-element kernels are the easy case: every thread writes its own output and nobody steps on anyone else. Real workloads are harder because threads must combine their results — summing an array, building a histogram, multiplying matrices — and the moment two threads want to update the same location, you have a coordination problem.
This final chapter covers the three patterns that solve it and appear in almost every serious CUDA kernel:
Together these turn the basics from the first two chapters into kernels that actually compete with library code.
Click any topic to jump in
Combining many values into one races when threads share a target; associativity enables a parallel tree.
Halve the stride each step with a barrier between rounds: N values to one in log N parallel steps.
Indivisible read-modify-write avoids lost updates; contention serializes, so reduce-then-atomic.
Shared-memory tiles read each value once per tile instead of once per output, raising arithmetic intensity.
This chapter is part of PixelBank Premium. Create a free account, then upgrade to read the full lesson — concepts, walkthroughs, and exercises.