Instruction Tuning & the LLaVA Paradigm

Chapter 6: Instruction Tuning & the LLaVA Paradigm

Explore the breakthrough insight that transformed vision-language models from fixed-task systems into general-purpose visual assistants. From LLaVA's visual instruction tuning methodology through GPT-4 powered data generation, two-stage training, and the evolution to LLaVA-1.5/NeXT, master the paradigm that defined modern VLM development.

Before LLaVA, vision-language models could answer fixed VQA questions ('What color is the car?') but could not engage in open-ended visual reasoning or follow complex instructions about images. The key insight of Visual Instruction Tuning (Liu et al., April 2023) was deceptively simple: generate instruction-following data about images using GPT-4, then fine-tune a VLM on this data. The result was a model that could have free-form conversations about images.

LLaVA (Large Language-and-Vision Assistant) established a paradigm that nearly every subsequent open-source VLM has followed:

Connect a pretrained vision encoder to a pretrained LLM via a lightweight projection layer
Stage 1: Feature alignment -- Train only the projection layer on image-caption pairs, teaching the LLM to 'understand' visual tokens
Stage 2: Instruction tuning -- Fine-tune the full model on diverse visual instruction data

This two-stage recipe is powerful because it leverages the enormous investment already made in training vision encoders (CLIP) and language models (LLaMA/Vicuna) separately. The projection layer is the 'rosetta stone' that translates between visual and linguistic representations.

The subsequent evolution to LLaVA-1.5 (October 2023) and LLaVA-NeXT (January 2024) showed that simple architectural changes (MLP projector, higher resolution) and better data mixing could dramatically improve performance, establishing this paradigm as the dominant approach for building open-source VLMs.

This chapter covers the full LLaVA paradigm:

Visual instruction tuning: The insight that changed VLM development
Data generation via GPT-4: Creating diverse instruction data from image metadata
Two-stage training: Why separating alignment from instruction tuning works
LLaVA-1.5 & LLaVA-NeXT: Architectural and data improvements
Conversation formats: How chat templates affect model behavior
SFT data strategies: Mixing academic and synthetic data for optimal results
Evaluation: Measuring instruction-following ability in VLMs

Chapter 6: Instruction Tuning & the LLaVA Paradigm

Chapter Overview

Chapter Roadmap

Visual Instruction Tuning

Data Generation

Two-Stage Training

LLaVA-1.5 & NeXT

Conversation Formats

SFT Data Strategies

Evaluation

Sign up to unlock this chapter