Deep dive into influential research papers that shaped modern computer vision. Learn key concepts through interactive visualizations and animations.
Efficient Universal Perception Encoder
Meta's two-stage knowledge-distillation recipe (scale up to a 1.9B proxy, then scale down) yields edge-deployable encoders that match same-size domain experts on classification, dense prediction, and VLM — all in 7–60ms on iPhone CPU.
Learning to Reason in 13 Parameters
Achieving 91% GSM8K accuracy with only 13 trainable parameters (26 bytes) through extreme parameter sharing and RL training.
Hybrid Mamba-MoE for Agentic Reasoning
NVIDIA's 120B MoE hybrid Mamba-Attention model with LatentMoE, Multi-Token Prediction, and NVFP4 training — 2.2x faster than GPT-OSS-120B.
Learned Depth-Wise Attention for Residual Connections
Replacing fixed residual accumulation with softmax attention over depth, enabling content-dependent information routing across layers.
Near-Optimal Vector Quantization with Zero Overhead
Training-free, data-oblivious vector quantization achieving 6x KV cache compression and 8x attention speedup via polar coordinate transforms and 1-bit QJL error correction.
Strengths and Limitations of Reasoning Models via Problem Complexity
Apple/NeurIPS 2025: controllable puzzle environments reveal three complexity regimes and a counter-intuitive thinking-token collapse in Large Reasoning Models.
Real-Time End-to-End Object Detection
Eliminating NMS with consistent dual assignments and optimizing efficiency-accuracy tradeoffs through holistic design.
Visual Instruction Tuning
Connecting a CLIP vision encoder to a large language model via a simple projection layer, trained in two stages with GPT-4-generated instruction data.
7B Parameters, Universal Visual Features
Self-supervised learning at scale with Gram Anchoring. 7B parameters trained on 1.7B images achieving SOTA on 60+ benchmarks.
Multilingual Vision-Language Encoders
Unified training recipe combining sigmoid loss, decoder objectives, self-distillation, and masked prediction for improved multimodal understanding.
Promptable Segmentation for Any Image
Foundation model for image segmentation that can segment any object using points, boxes, or masks as prompts.
Real-Time Radiance Field Rendering
Replaces neural radiance fields with explicit 3D Gaussians and tile-based rasterization, achieving real-time (>100 FPS) rendering with state-of-the-art visual quality.
Self-Supervised Visual Pre-training
Pre-training vision transformers by masking 75% of image patches and learning to reconstruct them.
Contrastive Language-Image Pre-training
Learning visual representations from natural language supervision using contrastive pre-training on 400M image-text pairs.
Denoising Diffusion Probabilistic Models
Learning to generate images by reversing a gradual noising process, achieving state-of-the-art image synthesis.
An Image is Worth 16x16 Words
Applying Transformers directly to image patches, achieving state-of-the-art results on image classification.
End-to-End Object Detection with Transformers
Treating object detection as a set prediction problem using Transformers and bipartite matching loss.
Decoupled Weight Decay Regularization
Why AdamW decouples weight decay from gradient updates, and when to use Adam vs AdamW in modern deep learning.
The Transformer Architecture
Introducing the Transformer architecture with self-attention mechanisms that revolutionized NLP and later computer vision.
Instance Segmentation Framework
Extending Faster R-CNN with a parallel mask branch for pixel-precise instance segmentation.
BatchNorm, LayerNorm, GroupNorm & InstanceNorm
Visual guide to normalization types: which dimensions each operates over and when to use which.
Deep Residual Learning
Skip connections enabling training of very deep networks, solving the degradation problem in neural networks.
Convolutional Networks for Biomedical Image Segmentation
Encoder-decoder architecture with skip connections for precise localization in image segmentation tasks.
More papers coming soon: Mask R-CNN, GANs, Diffusion Models, and more.