Week 1-2

Chapter 1: Introduction to LLMs

Understand what Large Language Models are, how they evolved from simple n-gram models to billion-parameter transformers, and why scaling changed everything. Explore key architectures, real-world applications, and the fundamental limitations that shape modern AI research.

Chapter Overview

Large Language Models (LLMs) represent a paradigm shift in artificial intelligence. Rather than hand-coding rules for language understanding, we train massive neural networks on vast corpora of text, and they learn to generate, translate, summarize, and reason about language with remarkable fluency.

The story of LLMs is one of scale. Early language models used statistical methods like n-grams and hidden Markov models. The introduction of neural language models (Bengio et al., 2003), followed by recurrent architectures (LSTMs, GRUs), and ultimately the Transformer (Vaswani et al., 2017) set the stage. But it was the discovery that simply scaling up model size, data, and compute leads to predictable capability gains---the so-called scaling laws---that ignited the modern LLM revolution.

Today, models like GPT-4, Claude, Llama, and Gemini power applications from code generation to medical diagnosis. Understanding how these systems work, what they can and cannot do, and where the field is heading is essential for any ML practitioner.

This chapter covers:

What are LLMs? Defining the class of models and what makes them "large"
History of Language Models: From n-grams to transformers
Key Architectures: Encoder-only, decoder-only, and encoder-decoder designs
Scaling Laws: How performance improves predictably with scale
Applications: Real-world uses across industries
Limitations: Hallucinations, reasoning gaps, and alignment challenges

Chapter Roadmap

Click any topic to jump in

What Are LLMs

Language modeling as next-token prediction — how a simple objective produces emergent intelligence at scale.

Language Modeling as Next-Token PredictionWhat Makes an LLM 'Large'Emergent CapabilitiesFoundation Models

Evolution and divergence

How language models evolved into distinct architectural families

History of Language Models

From n-gram counting to neural nets to transformers — each era solved the previous era's bottleneck.

N-gram ModelsNeural Language ModelsRecurrent Neural Networks (RNNs & LSTMs)The Transformer Revolution (2017)

Key Architectures

Encoder-only, decoder-only, encoder-decoder, and MoE — the design space of modern LLMs.

Encoder-Only (BERT-style)Decoder-Only (GPT-style)Encoder-Decoder (T5-style)Mixture of Experts (MoE)

Quantifying progress

Scaling Laws

Power-law relationships between model size, data, compute, and loss — the physics of training LLMs.

Kaplan Scaling Laws (OpenAI, 2020)Chinchilla Scaling Laws (DeepMind, 2022)Compute-Optimal TrainingInference-Time Scaling

Where LLMs work and where they fail

Real-world impact and fundamental constraints

Applications

Code generation, conversation, extraction, and scientific discovery — where LLMs deliver real value.

Code Generation & Software EngineeringConversational AI & AssistantsInformation Extraction & SummarizationScientific Research & Discovery

Limitations

Hallucination, reasoning gaps, knowledge cutoff, and bias — the boundaries of current LLM capabilities.

HallucinationReasoning LimitationsKnowledge Cutoff & StalenessBias & Safety

A Large Language Model is a neural network trained on massive text data to predict the next token in a sequence. The "large" refers to both parameter count (billions) and training data (trillions of tokens). Despite this simple objective, LLMs develop emergent capabilities like translation, summarization, and code generation.

Key insight: LLMs are fundamentally next-token predictors. All their impressive capabilities arise from learning statistical patterns in language at an enormous scale.

In this topic

1Language Modeling as Next-Token Prediction

2What Makes an LLM 'Large'

3Emergent Capabilities

4Foundation Models

1 of 4

Language Modeling as Next-Token Prediction

$P(x_t \mid x_1, x_2, \ldots, x_{t-1}) = \text{softmax}(W h_t)$

At its core, a language model estimates the probability of the next token given all previous tokens. The model processes the context through layers of computation to produce a hidden state $h_t$ , which is projected to a vocabulary-sized vector and passed through softmax to get probabilities. Training maximizes the likelihood of the actual next token across billions of examples.

Mathematical Intuition

The chain rule decomposes $P(x_1, \ldots, x_T) = \prod_t P(x_t \mid x_{<t})$ , meaning any joint distribution over sequences can be modeled autoregressively. The softmax output $P(x_t \mid x_{<t}) = \text{softmax}(W h_t)$ maps a $d$ -dimensional hidden state to a $V$ -dimensional probability simplex. Training minimizes cross-entropy: $\mathcal{L} = -\sum_t \log P(x_t \mid x_{<t})$ , which is equivalent to minimizing KL divergence from the true data distribution.

Example:

Given the context "The capital of France is", what does the LLM compute?

2 of 4

What Makes an LLM 'Large'

Three dimensions of scale define LLMs: (1) Parameter count --- from millions (GPT-1, 117M) to trillions (Switch Transformer). (2) Training data --- trillions of tokens from web crawls, books, code. (3) Compute --- thousands of GPUs for weeks or months. The interplay of these three factors determines model capability.

Mathematical Intuition

The parameter count $N$ typically scales as $N \approx 12 L d^2$ for a transformer with $L$ layers and dimension $d$ . Doubling $d$ quadruples parameters. The compute budget follows $C \approx 6ND$ FLOPs where $D$ is training tokens. A 70B model trained on 2T tokens requires approximately $6 \times 70 \times 10^9 \times 2 \times 10^{12} = 8.4 \times 10^{23}$ FLOPs — thousands of GPU-years.

Example:

GPT-3 has 175B parameters trained on 300B tokens. Llama 2 has 70B parameters trained on 2T tokens. Which factors differ?

3 of 4

Emergent Capabilities

As LLMs scale, they develop capabilities not explicitly trained for. Below certain scale thresholds, performance on tasks like arithmetic, translation, or code generation is near zero. Above the threshold, performance jumps dramatically. These emergent abilities include in-context learning, chain-of-thought reasoning, and instruction following.

Mathematical Intuition

Emergence can be formalized as a phase transition in performance: below a threshold scale $N^*$ , task accuracy $A(N) \approx \text{random}$ , and above it $A(N)$ jumps sharply. Some researchers argue this is an artifact of nonlinear evaluation metrics — under log-linear metrics, performance often improves smoothly. The debate centers on whether $A(N)$ is truly discontinuous or just appears so under certain metrics.

Example:

A 1B parameter model cannot do 3-digit addition. A 100B model can. Why?

4 of 4

Foundation Models

LLMs are a type of foundation model---a large model trained on broad data that can be adapted to many downstream tasks. The term (coined by Stanford HAI, 2021) emphasizes that one pretrained model serves as the foundation for countless applications, from chatbots to code completion to scientific discovery.

Mathematical Intuition

Transfer learning exploits the factorization $P_{\text{task}}(y \mid x) = P_{\text{pretrained}}(y \mid x) \cdot \frac{P_{\text{task}}(y \mid x)}{P_{\text{pretrained}}(y \mid x)}$ . Fine-tuning adjusts the ratio term with far fewer examples than learning $P_{\text{task}}$ from scratch. LoRA approximates the weight update as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ , $A \in \mathbb{R}^{r \times d}$ , with rank $r \ll d$ , reducing trainable parameters from $d^2$ to $2dr$ .

Example:

Why is it more efficient to fine-tune a foundation model than train from scratch for each task?

Theory Exercise

Problem:

Explain why next-token prediction, a seemingly simple objective, leads to models that can perform complex tasks like writing code or solving math problems.

Hints:

Think about what a model must understand to predict the next token accurately
Consider the diversity of text in training data
Think about what it means to predict the next token in a math proof

Papers

Attention Is All You Need Language Models are Few-Shot Learners (GPT-3)

Blogs

The Illustrated Transformer — Jay Alammar What Are Large Language Models — Hugging Face

Tokenization & Embeddings

Chapter 1: Introduction to LLMs

Chapter Overview

Chapter Roadmap

What Are LLMs

History of Language Models

Key Architectures

Scaling Laws

Applications

Limitations

What are LLMs?

In this topic

Language Modeling as Next-Token Prediction

What Makes an LLM 'Large'

Emergent Capabilities

Foundation Models

Theory Exercise

Problem:

Hints:

History of Language Models

Key Architectures

Scaling Laws

Applications of LLMs

Limitations