Understand what Large Language Models are, how they evolved from simple n-gram models to billion-parameter transformers, and why scaling changed everything. Explore key architectures, real-world applications, and the fundamental limitations that shape modern AI research.
Large Language Models (LLMs) represent a paradigm shift in artificial intelligence. Rather than hand-coding rules for language understanding, we train massive neural networks on vast corpora of text, and they learn to generate, translate, summarize, and reason about language with remarkable fluency.
The story of LLMs is one of scale. Early language models used statistical methods like n-grams and hidden Markov models. The introduction of neural language models (Bengio et al., 2003), followed by recurrent architectures (LSTMs, GRUs), and ultimately the Transformer (Vaswani et al., 2017) set the stage. But it was the discovery that simply scaling up model size, data, and compute leads to predictable capability gains---the so-called scaling laws---that ignited the modern LLM revolution.
Today, models like GPT-4, Claude, Llama, and Gemini power applications from code generation to medical diagnosis. Understanding how these systems work, what they can and cannot do, and where the field is heading is essential for any ML practitioner.
This chapter covers:
Click any topic to jump in
Language modeling as next-token prediction — how a simple objective produces emergent intelligence at scale.
How language models evolved into distinct architectural families
From n-gram counting to neural nets to transformers — each era solved the previous era's bottleneck.
Encoder-only, decoder-only, encoder-decoder, and MoE — the design space of modern LLMs.
Power-law relationships between model size, data, compute, and loss — the physics of training LLMs.
Real-world impact and fundamental constraints
Code generation, conversation, extraction, and scientific discovery — where LLMs deliver real value.
Hallucination, reasoning gaps, knowledge cutoff, and bias — the boundaries of current LLM capabilities.
A Large Language Model is a neural network trained on massive text data to predict the next token in a sequence. The "large" refers to both parameter count (billions) and training data (trillions of tokens). Despite this simple objective, LLMs develop emergent capabilities like translation, summarization, and code generation.
Key insight: LLMs are fundamentally next-token predictors. All their impressive capabilities arise from learning statistical patterns in language at an enormous scale.
At its core, a language model estimates the probability of the next token given all previous tokens. The model processes the context through layers of computation to produce a hidden state , which is projected to a vocabulary-sized vector and passed through softmax to get probabilities. Training maximizes the likelihood of the actual next token across billions of examples.
The chain rule decomposes , meaning any joint distribution over sequences can be modeled autoregressively. The softmax output maps a -dimensional hidden state to a -dimensional probability simplex. Training minimizes cross-entropy: , which is equivalent to minimizing KL divergence from the true data distribution.
Given the context "The capital of France is", what does the LLM compute?
Three dimensions of scale define LLMs: (1) Parameter count --- from millions (GPT-1, 117M) to trillions (Switch Transformer). (2) Training data --- trillions of tokens from web crawls, books, code. (3) Compute --- thousands of GPUs for weeks or months. The interplay of these three factors determines model capability.
The parameter count typically scales as for a transformer with layers and dimension . Doubling quadruples parameters. The compute budget follows FLOPs where is training tokens. A 70B model trained on 2T tokens requires approximately FLOPs — thousands of GPU-years.
GPT-3 has 175B parameters trained on 300B tokens. Llama 2 has 70B parameters trained on 2T tokens. Which factors differ?
As LLMs scale, they develop capabilities not explicitly trained for. Below certain scale thresholds, performance on tasks like arithmetic, translation, or code generation is near zero. Above the threshold, performance jumps dramatically. These emergent abilities include in-context learning, chain-of-thought reasoning, and instruction following.
Emergence can be formalized as a phase transition in performance: below a threshold scale , task accuracy , and above it jumps sharply. Some researchers argue this is an artifact of nonlinear evaluation metrics — under log-linear metrics, performance often improves smoothly. The debate centers on whether is truly discontinuous or just appears so under certain metrics.
A 1B parameter model cannot do 3-digit addition. A 100B model can. Why?
LLMs are a type of foundation model---a large model trained on broad data that can be adapted to many downstream tasks. The term (coined by Stanford HAI, 2021) emphasizes that one pretrained model serves as the foundation for countless applications, from chatbots to code completion to scientific discovery.
Transfer learning exploits the factorization . Fine-tuning adjusts the ratio term with far fewer examples than learning from scratch. LoRA approximates the weight update as where , , with rank , reducing trainable parameters from to .
Why is it more efficient to fine-tune a foundation model than train from scratch for each task?
Explain why next-token prediction, a seemingly simple objective, leads to models that can perform complex tasks like writing code or solving math problems.