Interactive Learning from Research Papers

Advanced Concepts

Deep dive into influential research papers that shaped modern computer vision. Learn key concepts through interactive visualizations and animations.

23 Papers

Interactive Demos

Research Insights

Research Papers

Efficient Vision Encoders

EUPE

Efficient Universal Perception Encoder

Meta's two-stage knowledge-distillation recipe (scale up to a 1.9B proxy, then scale down) yields edge-deployable encoders that match same-size domain experts on classification, dense prediction, and VLM — all in 7–60ms on iPhone CPU.

Scale Up Then Scale DownMulti-Teacher AggregationMulti-Resolution Finetuning+1

Start Learning

Parameter-Efficient Fine-Tuning

TinyLoRA

Learning to Reason in 13 Parameters

Achieving 91% GSM8K accuracy with only 13 trainable parameters (26 bytes) through extreme parameter sharing and RL training.

Sub-Rank AdaptationRank Below OneRL vs SFT Efficiency+2

Start Learning

Large Language Models

Nemotron 3 Super

Hybrid Mamba-MoE for Agentic Reasoning

NVIDIA's 120B MoE hybrid Mamba-Attention model with LatentMoE, Multi-Token Prediction, and NVFP4 training — 2.2x faster than GPT-OSS-120B.

Pipeline SummaryLatentMoE ArchitectureHybrid Mamba-Attention+4

Start Learning

Large Language Models

Attention Residuals

Learned Depth-Wise Attention for Residual Connections

Replacing fixed residual accumulation with softmax attention over depth, enabling content-dependent information routing across layers.

Pipeline SummaryPreNorm DilutionFull AttnRes+3

Start Learning

Quantization & Compression

TurBoQuant

Near-Optimal Vector Quantization with Zero Overhead

Training-free, data-oblivious vector quantization achieving 6x KV cache compression and 8x attention speedup via polar coordinate transforms and 1-bit QJL error correction.

Quantization OverheadPolarQuantQJL Error Correction+2

Start Learning

LLM Reasoning & Evaluation

The Illusion of Thinking

Strengths and Limitations of Reasoning Models via Problem Complexity

Apple/NeurIPS 2025: controllable puzzle environments reveal three complexity regimes and a counter-intuitive thinking-token collapse in Large Reasoning Models.

Pipeline SummaryBenchmark ContaminationControllable Puzzles+4

Start Learning

Object Detection

YOLOv10

Real-Time End-to-End Object Detection

Eliminating NMS with consistent dual assignments and optimizing efficiency-accuracy tradeoffs through holistic design.

Dual AssignmentsNMS-Free InferencePSA Module+4

Start Learning

Multimodal

LLaVA

Visual Instruction Tuning

Connecting a CLIP vision encoder to a large language model via a simple projection layer, trained in two stages with GPT-4-generated instruction data.

Visual TokenizationVision-Language ProjectionTwo-Stage Training+4

Start Learning

Self-SupervisedPRO

DINOv3

7B Parameters, Universal Visual Features

Self-supervised learning at scale with Gram Anchoring. 7B parameters trained on 1.7B images achieving SOTA on 60+ benchmarks.

Self-DistillationGram Anchoring7B Scale+3

Pro Only

Vision-LanguagePRO

SigLIP 2

Multilingual Vision-Language Encoders

Unified training recipe combining sigmoid loss, decoder objectives, self-distillation, and masked prediction for improved multimodal understanding.

Sigmoid LossDecoder ObjectivesSelf-Distillation+3

Pro Only

SegmentationPRO

Segment Anything (SAM)

Promptable Segmentation for Any Image

Foundation model for image segmentation that can segment any object using points, boxes, or masks as prompts.

Promptable SegmentationModel ArchitectureZero-Shot Transfer+3

Pro Only

3D ReconstructionPRO

3D Gaussian Splatting

Real-Time Radiance Field Rendering

Replaces neural radiance fields with explicit 3D Gaussians and tile-based rasterization, achieving real-time (>100 FPS) rendering with state-of-the-art visual quality.

3D Gaussian PrimitivesDifferentiable SplattingSpherical Harmonics+3

Pro Only

Self-SupervisedPRO

Masked Autoencoders

Self-Supervised Visual Pre-training

Pre-training vision transformers by masking 75% of image patches and learning to reconstruct them.

Asymmetric Encoder-DecoderHigh Masking RatioRandom Masking+3

Pro Only

Vision-LanguagePRO

CLIP

Contrastive Language-Image Pre-training

Learning visual representations from natural language supervision using contrastive pre-training on 400M image-text pairs.

Contrastive Pre-trainingDual EncoderZero-Shot Transfer+3

Pro Only

GenerativePRO

Diffusion Models

Denoising Diffusion Probabilistic Models

Learning to generate images by reversing a gradual noising process, achieving state-of-the-art image synthesis.

Forward ProcessReverse DenoisingNoise Schedule+4

Pro Only

ArchitecturePRO

Vision Transformer (ViT)

An Image is Worth 16x16 Words

Applying Transformers directly to image patches, achieving state-of-the-art results on image classification.

Patch EmbeddingsPosition EmbeddingsClass Token+3

Pro Only

Object DetectionPRO

DETR

End-to-End Object Detection with Transformers

Treating object detection as a set prediction problem using Transformers and bipartite matching loss.

Object QueriesBipartite MatchingEncoder-Decoder+4

Pro Only

OptimizationPRO

AdamW vs Adam

Decoupled Weight Decay Regularization

Why AdamW decouples weight decay from gradient updates, and when to use Adam vs AdamW in modern deep learning.

MomentumAdaptive Learning RatesWeight Decay vs L2+2

Pro Only

ArchitecturePRO

Attention Is All You Need

The Transformer Architecture

Introducing the Transformer architecture with self-attention mechanisms that revolutionized NLP and later computer vision.

Self-AttentionMulti-Head AttentionPositional Encoding+4

Pro Only

Instance SegmentationPRO

Mask R-CNN

Instance Segmentation Framework

Extending Faster R-CNN with a parallel mask branch for pixel-precise instance segmentation.

Two-Stage PipelineRoIAlignMask Branch+3

Pro Only

TrainingPRO

Normalization Techniques

BatchNorm, LayerNorm, GroupNorm & InstanceNorm

Visual guide to normalization types: which dimensions each operates over and when to use which.

Internal Covariate ShiftBatch NormalizationLayer Normalization+2

Pro Only

ArchitecturePRO

ResNet

Deep Residual Learning

Skip connections enabling training of very deep networks, solving the degradation problem in neural networks.

Skip ConnectionsGradient FlowBottleneck+4

Pro Only

SegmentationPRO

U-Net

Convolutional Networks for Biomedical Image Segmentation

Encoder-decoder architecture with skip connections for precise localization in image segmentation tasks.

Encoder-DecoderSkip ConnectionsTransposed Convolution+4

Pro Only

More papers coming soon: Mask R-CNN, GANs, Diffusion Models, and more.

Study Plans

CV Study Plan ML Study Plan LLM Study Plan NLP Study Plan Diffusion Study Plan VLM Study Plan 3D CV Study Plan Foundations Practice Problems Collections