Learn by Interaction

Reinforcement Learning Study Plan

An 8-chapter journey from the foundations of sequential decision-making to modern deep RL. Master MDPs, dynamic programming, Monte Carlo and TD learning, Q-learning, DQN, policy gradients, and actor-critic methods like PPO.

8 Chapters8 WeeksMath Intuition & Examples

Recommended Study Path

Phase 1

Foundations

Weeks 1-3

Ch 1: Introduction to RL
Ch 2: Markov Decision Processes
Ch 3: Dynamic Programming

The vocabulary, the MDP formalism, and planning with a known model

Phase 2

Model-Free Learning

Weeks 4-5

Ch 4: Monte Carlo & TD
Ch 5: Q-Learning & SARSA

Learning value functions directly from experience

Phase 3

Deep RL

Week 6

Ch 6: Deep Q-Networks (DQN)

Function approximation, experience replay, target networks

Phase 4

Policy Optimization

Weeks 7-8

Ch 7: Policy Gradients
Ch 8: Actor-Critic & PPO

Optimizing policies directly, up to modern PPO

Tip: RL builds on probability and gradients — the Foundations and ML plans are useful prerequisites.

Pro chapters (4-8) require a premium subscription.

All Chapters

Introduction to Reinforcement Learning

Learning from interaction, the agent-environment loop, rewards and returns, policies and value functions, and the exploration-exploitation tradeoff.

What Is Reinforcement Learning?The Agent-Environment InterfaceRewards and Returns+2

Start Learning

Coming Soon

Markov Decision Processes

Formalizing sequential decisions: states, actions, transition dynamics, the Markov property, return, and the Bellman equations.

The Markov PropertyMDP DefinitionTransition Dynamics+2

Coming Soon

Dynamic Programming

Solving known MDPs with planning: policy evaluation, policy improvement, policy iteration, value iteration, and generalized policy iteration.

Policy EvaluationPolicy ImprovementPolicy Iteration+2

Coming Soon

PRO

Monte Carlo & Temporal-Difference Learning

Learning from experience without a model: Monte Carlo prediction, TD(0), bootstrapping, and the bias-variance tradeoff between them.

Monte Carlo PredictionMonte Carlo ControlTD(0) Prediction+2

Pro Only

Coming Soon

PRO

Q-Learning & SARSA

Model-free control with action-value methods: SARSA (on-policy), Q-learning (off-policy), epsilon-greedy exploration, and convergence.

Action-Value MethodsSARSA (On-Policy)Q-Learning (Off-Policy)+2

Pro Only

Coming Soon

PRO

Deep Q-Networks (DQN)

Scaling RL with function approximation: neural network value functions, experience replay, target networks, and DQN improvements.

Function ApproximationThe DQN ArchitectureExperience Replay+2

Pro Only

Coming Soon

PRO

Policy Gradient Methods

Optimizing policies directly: the policy gradient theorem, REINFORCE, baselines and variance reduction, and continuous action spaces.

Why Policy GradientsThe Policy Gradient TheoremREINFORCE+2

Pro Only

Coming Soon

PRO

Actor-Critic & PPO

Combining value and policy learning: actor-critic architectures, advantage estimation (GAE), A2C, and Proximal Policy Optimization.

Actor-Critic ArchitectureAdvantage Estimation (GAE)A2C / A3C+2

Pro Only

Curriculum inspired by Sutton & Barto's Reinforcement Learning: An Introduction, taking you from RL foundations to modern deep reinforcement learning.