Parallel Patterns: Reductions, Atomics & Tiling

One-thread-per-element kernels are the easy case: every thread writes its own output and nobody steps on anyone else. Real workloads are harder because threads must combine their results — summing an array, building a histogram, multiplying matrices — and the moment two threads want to update the same location, you have a coordination problem.

This final chapter covers the three patterns that solve it and appear in almost every serious CUDA kernel:

The Reduction Problem: why combining values is fundamentally different from mapping them
Tree Reduction in Shared Memory: collapsing N values to one in log N parallel steps
Atomics: race-free read-modify-write on a shared location, and the price of contention
Tiled Matrix Multiplication: shared-memory tiles that turn a memory-bound matmul into a fast one

Together these turn the basics from the first two chapters into kernels that actually compete with library code.

Chapter 3: Parallel Patterns: Reductions, Atomics & Tiling

Chapter Overview

Chapter Roadmap

The Reduction Problem

Tree Reduction in Shared Memory

Atomics

Tiled Matrix Multiplication

Sign up to unlock this chapter