Parameter-Efficient Fine-Tuning

Full fine-tuning of a 7B parameter VLM requires storing four copies of every parameter in GPU memory: the parameters themselves (bf16, 14GB), their gradients (bf16, 14GB), and the Adam optimizer states (fp32 master weights + momentum + variance, 84GB). The total: 112GB per GPU — impossible even on the latest 80GB GPUs without multi-GPU setups. For a 70B model, the numbers are 10x worse. This memory barrier makes full fine-tuning impractical for most research labs and entirely impossible for production teams that need to adapt VLMs to dozens of tasks.

Parameter-Efficient Fine-Tuning (PEFT) methods solve this by training only a small number of additional parameters while keeping the pretrained weights frozen. The insight is profound: the adaptation from a general model to a task-specific one can be captured in a surprisingly low-dimensional subspace. LoRA demonstrates this by decomposing weight updates into low-rank matrices, reducing trainable parameters by 100-1000x while preserving 95%+ of full fine-tuning performance.

The key PEFT methods form a hierarchy of approaches:

LoRA (Low-Rank Adaptation): Adds trainable low-rank matrices $B \cdot A$ alongside frozen weights. The most widely used PEFT method, effective and simple.
QLoRA: Combines LoRA with 4-bit quantization of the base model, enabling 65B model fine-tuning on a single 48GB GPU.
Adapter Tuning: Inserts small bottleneck layers between transformer blocks. Pioneered PEFT but largely superseded by LoRA.
Prefix/Prompt Tuning: Prepends learned "soft prompts" to the input, modifying model behavior without changing any weights.

For VLMs specifically, PEFT raises a unique question: which components should be adapted? The vision encoder, the projection layer, and the LLM backbone each contribute differently to task performance. Applying LoRA to the wrong components wastes parameters; applying it to the right ones can match full fine-tuning at 0.1% of the parameter cost.

This chapter covers each PEFT method in depth: the mathematical foundations (why low-rank works, what information is captured in the rank- $r$ subspace), practical implementation (which layers, what rank, what learning rate), and the production story (merging adapters for zero-overhead inference, serving multiple tasks from a single base model, and combining adapters via model arithmetic).

The formalism for LoRA begins with a simple observation: the weight update matrix $\Delta W$ during fine-tuning has low intrinsic rank. If $\Delta W \in \mathbb{R}^{d \times d}$ but $\text{rank}(\Delta W) \approx r \ll d$ , then we can represent it as $\Delta W = BA$ where $B \in \mathbb{R}^{d \times r}$ and $A \in \mathbb{R}^{r \times d}$ . This reduces the parameter count from $d^2$ to $2dr$ — a factor of $d/(2r)$ savings. For a typical $d = 4096$ and $r = 16$ , this is a 128x reduction.

Chapter 8: Parameter-Efficient Fine-Tuning

Chapter Overview

Chapter Roadmap

Why PEFT

LoRA

Adapter Tuning

Prefix & Prompt Tuning

QLoRA

VLM Fine-Tuning Practice

Merging & Serving

Sign up to unlock this chapter