Explore the breakthrough insight that transformed vision-language models from fixed-task systems into general-purpose visual assistants. From LLaVA's visual instruction tuning methodology through GPT-4 powered data generation, two-stage training, and the evolution to LLaVA-1.5/NeXT, master the paradigm that defined modern VLM development.
Before LLaVA, vision-language models could answer fixed VQA questions ('What color is the car?') but could not engage in open-ended visual reasoning or follow complex instructions about images. The key insight of Visual Instruction Tuning (Liu et al., April 2023) was deceptively simple: generate instruction-following data about images using GPT-4, then fine-tune a VLM on this data. The result was a model that could have free-form conversations about images.
LLaVA (Large Language-and-Vision Assistant) established a paradigm that nearly every subsequent open-source VLM has followed:
This two-stage recipe is powerful because it leverages the enormous investment already made in training vision encoders (CLIP) and language models (LLaMA/Vicuna) separately. The projection layer is the 'rosetta stone' that translates between visual and linguistic representations.
The subsequent evolution to LLaVA-1.5 (October 2023) and LLaVA-NeXT (January 2024) showed that simple architectural changes (MLP projector, higher resolution) and better data mixing could dramatically improve performance, establishing this paradigm as the dominant approach for building open-source VLMs.
This chapter covers the full LLaVA paradigm:
Click any topic to jump in
The paradigm shift from fixed-task VQA to open-ended visual assistants — why LLaVA changed how we build VLMs.
Data and training — both essential
Using GPT-4 as a data engine to bootstrap 158K visual instruction examples from image captions.
Stage 1 aligns vision to language, Stage 2 teaches instruction following — why this ordering matters.
MLP projectors, high-resolution inputs, and better data mixing — simple changes that doubled performance.
How to format inputs and curate training data
Chat templates, system prompts, and multi-turn context — the interface between user and model.
Mixing academic VQA, synthetic instructions, and ShareGPT conversations — data composition determines behavior.
GPT-4 as judge, benchmark suites, and hallucination detection — measuring what instruction-tuned VLMs can do.
This chapter is part of PixelBank Premium. Create a free account, then upgrade to read the full lesson — concepts, walkthroughs, and exercises.