Write Fast GPU Code in Python

GPU Basics with Triton Study Plan

An 8-chapter curriculum that takes you from "what is a warp?" to writing fused, tiled GPU kernels in Triton that rival hand-tuned CUDA — all in Python.

8 Chapters8 WeeksInteractive Demos

Recommended Study Path

Phase 1

Prerequisites

Foundations Study Plan

Python Foundations
NumPy & Arrays
Basic Linear Algebra

Complete the Foundations study plan first →

Phase 2

GPU Fundamentals

Weeks 1-2

Ch 1: Why GPUs?
Ch 2: Architecture & Memory

Warps, SMs, the memory hierarchy

Phase 3

Writing Kernels

Weeks 2-5

Ch 3-4: Triton & Vector Add
Ch 5: Autotuning & Benchmarks

program_id, offsets, masking, BLOCK_SIZE

Phase 4

High Performance

Weeks 5-8

Ch 6-7: Fusion & Matmul
Ch 8: Advanced Kernels

Fused softmax, tiled matmul, attention

Tip: Each chapter includes interactive demos and theory exercises.

Pro chapters (3-8) require a premium subscription.

All Chapters

Why GPUs? The Parallel Computing Mindset

Why GPUs exist, CPU vs GPU design philosophy, latency vs throughput, the SIMT execution model, when parallel hardware wins, and Amdahl's law.

CPUs vs GPUsLatency vs ThroughputThe SIMT Execution Model+2

Start Learning

GPU Architecture & the Memory Hierarchy

Threads, warps, blocks and streaming multiprocessors, the register→SRAM→DRAM hierarchy, memory coalescing, bandwidth vs latency, and the golden rule of keeping data high in the hierarchy.

Threads, Warps & BlocksStreaming MultiprocessorsThe Memory Hierarchy+2

Start Learning

PRO

Introduction to Triton

What Triton is and why it exists, Triton vs CUDA vs PyTorch, the block-oriented programming model, programs, grids and tiles, and the @triton.jit decorator.

What Is Triton?Triton vs CUDA vs PyTorchBlock-Oriented Programming+2

Pro Only

PRO

Your First Kernel: Vector Addition

Building the canonical Triton kernel step by step: program ids, computing offsets, loading and storing with pointers, masking the ragged tail, choosing BLOCK_SIZE, and launching the grid.

Anatomy of a Triton KernelProgram IDs & OffsetsLoads, Stores & Masking+2

Pro Only

PRO

Memory Access, Autotuning & Benchmarking

Writing memory-bound kernels that hit peak bandwidth: coalesced access patterns, autotuning over block sizes and num_warps, benchmarking with triton.testing, and reading a roofline.

Coalesced Memory AccessAutotuning KernelsBenchmarking with triton.testing+2

Pro Only

PRO

Kernel Fusion: Fused Softmax

Why fusion is the central optimization on GPUs: counting DRAM round-trips, the row-wise reduction pattern, numerically stable softmax, keeping a row in SRAM, and the fused-vs-unfused speedup.

The Cost of DRAM Round-TripsWhat Is Kernel Fusion?The Row-Wise Reduction Pattern+2

Pro Only

PRO

Matrix Multiplication

The kernel that powers deep learning: blocked/tiled matmul, accumulating in SRAM, the inner-k loop, super-grouping program ids for L2 cache reuse, and approaching cuBLAS performance.

Naive vs Tiled MatmulBlocking & SRAM AccumulationThe Inner-K Loop+2

Pro Only

PRO

Advanced Kernels & Optimization

Putting it together: fused layer normalization, low-memory dropout with seeded RNG, the ideas behind fused attention (FlashAttention), persistent kernels, and where to go next.

Fused Layer NormalizationLow-Memory DropoutFused Attention (FlashAttention)+2

Pro Only

Practice Problem Sets

Sharpen your skills with coding challenges and system design problems.

Curriculum designed to take you from GPU fundamentals to writing high-performance Triton kernels that approach cuBLAS-level speed.

GPU Basics with Triton Study Plan

Recommended Study Path

Prerequisites

GPU Fundamentals

Writing Kernels

High Performance

All Chapters

Why GPUs? The Parallel Computing Mindset

GPU Architecture & the Memory Hierarchy

Introduction to Triton

Your First Kernel: Vector Addition

Memory Access, Autotuning & Benchmarking

Kernel Fusion: Fused Softmax

Matrix Multiplication

Advanced Kernels & Optimization

Practice Problem Sets

Triton Programming

GPU Basics with Triton Study Plan

Recommended Study Path

Prerequisites

GPU Fundamentals

Writing Kernels

High Performance

All Chapters

Why GPUs? The Parallel Computing Mindset

GPU Architecture & the Memory Hierarchy

Introduction to Triton

Your First Kernel: Vector Addition

Memory Access, Autotuning & Benchmarking

Kernel Fusion: Fused Softmax

Matrix Multiplication

Advanced Kernels & Optimization

Practice Problem Sets

Triton Programming