3D Reconstruction2023

3D Gaussian Splatting

Real-Time Radiance Field Rendering

Bernhard Kerbl, Georgios Kopanas, Thomas Leimkühler, George Drettakis

Paper Overview

3D Gaussian Splatting (Kerbl et al., SIGGRAPH 2023) represents scenes as collections of anisotropic 3D Gaussians, enabling real-time novel view synthesis at quality competitive with the best Neural Radiance Field (NeRF) methods while rendering at >100 FPS at 1080p resolution on a single RTX 3090 GPU.

The fundamental insight is replacing neural networks with an explicit, point-based scene representation. Each scene is modeled by 1 to 5 million 3D Gaussians, where every Gaussian carries learnable attributes: a position (mean) $\boldsymbol{\mu} \in \mathbb{R}^3$ , a full 3D covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}$ (parameterized via a rotation quaternion $\mathbf{q} \in \mathbb{R}^4$ and a scale vector $\mathbf{s} \in \mathbb{R}^3$ ), an opacity $\alpha \in [0, 1]$ , and 48 spherical harmonic coefficients encoding view-dependent color (16 SH basis functions per RGB channel at degree $l = 3$ ). Each Gaussian stores approximately 59 scalar parameters, and the total scene representation occupies 50-200 MB depending on complexity. A custom tile-based CUDA rasterizer projects and composites these Gaussians in real time without any neural network evaluation at render time.

Key advances:

Explicit 3D Gaussians: No neural network at render time — pure rasterization of ~59 parameters per Gaussian, with 1-5M Gaussians per scene
Differentiable splatting: Project 3D covariance to 2D via $\boldsymbol{\Sigma}' = \mathbf{J}\mathbf{W}\boldsymbol{\Sigma}\mathbf{W}^T\mathbf{J}^T$ , enabling gradient-based optimization from multi-view images
Spherical harmonics: Degree-3 SH (48 coefficients) captures view-dependent appearance including specular highlights without any MLP
Adaptive density control: Clone under-reconstructed regions, split over-large Gaussians, prune near-transparent ones — growing from ~50K SfM points to millions
Tile-based rasterization: 16x16 pixel tiles processed in parallel on the GPU with radix sort and early termination, enabling both fast forward rendering and efficient backward gradient computation
Real-time rendering: 134 FPS average on Mip-NeRF 360 scenes at 1080p, with training completing in ~30 minutes on a single GPU

3D Gaussian Splatting achieves state-of-the-art visual quality on standard benchmarks — 33.32 dB PSNR on Mip-NeRF 360 (comparable to Mip-NeRF 360's 33.09 dB), 27.41 dB on Tanks & Temples, and 29.41 dB on Deep Blending — while being ~100-1000x faster to render than NeRF methods. It bridges the gap between quality and interactivity for the first time, enabling applications in VR, gaming, digital twins, and real-time 3D content creation.

Chapter Roadmap

Click any topic to jump in

3D Gaussian Primitives

Anisotropic Gaussians parameterized by mean, rotation, scale, opacity, and SH — a differentiable scene representation.

renders via

Spherical Harmonics

Per-Gaussian SH coefficients (degree 3, 48 scalars) encode smooth view-dependent color without MLPs.

Differentiable Splatting

3D→2D projection with Jacobian covariance + alpha compositing gives closed-form volume rendering.

accelerated and refined by

Tile-Based Rasterization

16×16 tile sort and early termination turn O(N·P) splatting into a linear GPU workload.

Adaptive Density Control

Gradient-triggered clone/split and opacity pruning let the primitive count self-tune to scene complexity.

trained end-to-end by

Optimization Pipeline

L1 + D-SSIM loss with Adam and density control converges 100× faster than NeRF at matching quality.

3D Gaussian Splatting represents scenes as millions of anisotropic 3D Gaussians, each defined by a position $\boldsymbol{\mu} \in \mathbb{R}^3$ , a covariance $\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}$ , opacity $\alpha \in [0,1]$ , and 48 spherical harmonic coefficients for view-dependent color — an explicit, differentiable alternative to neural radiance fields that requires no neural network at render time.

The Problem

The NeRF bottleneck:

Neural Radiance Fields (NeRF, Mildenhall et al. 2020) achieve remarkable novel view synthesis by encoding scenes inside an MLP network (typically 8 layers of 256 hidden units). However, rendering requires ray marching: for each pixel, sample 64-256 points along the ray and evaluate the MLP at each point. For a single 1080p frame (1920x1080 = 2,073,600 pixels), this means 130 million to 530 million MLP forward passes. Even with hardware-accelerated hash grids (Instant-NGP), rendering takes 50-200ms per frame — far from real-time interactive rates.

Why implicit representations are fundamentally slow:

Each pixel requires 64-256 MLP evaluations (stratified + importance sampling along the ray), where each evaluation involves a forward pass through an 8-layer, 256-wide MLP (~590K parameters)
The neural network IS the scene — there is no geometry to rasterize, no way to skip computation for occluded regions
Ray marching is inherently serial per-pixel: you must evaluate sample by sample along each ray to compute transmittance and accumulated color $C(\mathbf{r}) = \int_0^\infty T(t) \sigma(\mathbf{r}(t)) \mathbf{c}(\mathbf{r}(t), \mathbf{d}) \, dt$
Occupancy grids and hierarchical sampling help, but the bottleneck remains the massive number of MLP queries
GPU rasterization pipelines (designed for triangles and texture mapping at billions of primitives/second) sit idle because the scene has no explicit geometry

The representation gap:

Traditional meshes and point clouds render at thousands of FPS on modern GPUs (hardware rasterization), but are extremely difficult to optimize from photographs — mesh topology changes are non-differentiable, and point clouds lack surface connectivity
NeRFs optimize beautifully from images via differentiable volume rendering, but are too slow for real-time use: vanilla NeRF renders at ~0.03 FPS, Mip-NeRF 360 at ~0.06 FPS, even Instant-NGP at only ~5-15 FPS
We need a representation that is both explicit (leveraging GPU rasterization for >30 FPS rendering) and differentiable (trainable end-to-end from multi-view images with gradient descent)

Previous point-based attempts (Yifan et al. 2019, Wiles et al. 2020, Kopanas et al. 2022) used simple spheres, oriented discs, or neural point features, but suffered from three critical limitations: (1) discrete primitives with hard boundaries created visible seams between points, (2) isotropic shapes could not conform to thin surfaces or elongated structures, and (3) the rendering pipeline was either non-differentiable or required per-point neural network evaluation, negating the speed advantage of explicit representations.

The Solution

3D Gaussians solve both problems simultaneously: they are explicit primitives that render without neural networks (pure mathematical evaluation), yet their smooth, continuous nature makes them ideal for differentiable optimization from images.

Per-Gaussian parameters (59 scalars total):

Each Gaussian $g$ is defined by five learnable attributes:

1. Position (mean) $\boldsymbol{\mu} \in \mathbb{R}^3$ — 3 parameters The 3D center point of the Gaussian in world coordinates. Initialized from SfM point cloud coordinates (COLMAP output). During optimization, positions shift to minimize photometric reconstruction error, with a learning rate starting at $1.6 \times 10^{-4}$ and decaying exponentially to $1.6 \times 10^{-6}$ over 30K iterations.

2. Covariance matrix $\boldsymbol{\Sigma} \in \mathbb{R}^{3 \times 3}$ — 7 parameters (4 quaternion + 3 scale) Defines the shape and orientation of the Gaussian ellipsoid. Direct optimization of $\boldsymbol{\Sigma}$ is problematic: a $3 \times 3$ symmetric matrix has 6 degrees of freedom, but arbitrary updates can produce matrices that are not positive semi-definite (physically meaningless). The decomposition is: $\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^T\mathbf{R}^T$ where $\mathbf{R} \in SO(3)$ is a rotation matrix (stored as a unit quaternion $\mathbf{q} \in \mathbb{R}^4$ with normalization constraint $\|\mathbf{q}\| = 1$ ) and $\mathbf{S} = \text{diag}(s_x, s_y, s_z)$ is a diagonal scaling matrix ( $\mathbf{s} \in \mathbb{R}^3$ , stored as log-scale for numerical stability). This factorization guarantees that $\boldsymbol{\Sigma}$ is always positive semi-definite regardless of the optimizer's updates — any quaternion and any scale values produce a valid covariance. The quaternion encodes the ellipsoid's orientation (which direction it points), while the scale encodes the ellipsoid's extent along each principal axis (how elongated it is). A flat Gaussian modeling a wall surface might have $\mathbf{s} = (0.1, 2.0, 2.0)$ — thin in one direction, extended in the other two.

3. Opacity $\alpha \in [0, 1]$ — 1 parameter How opaque this Gaussian is. Stored internally as a logit $a \in \mathbb{R}$ and passed through sigmoid: $\alpha = \sigma(a) = 1/(1 + e^{-a})$ . This reparameterization ensures $\alpha$ stays in $[0, 1]$ during unconstrained optimization. Opacity is initialized at $\sigma^{-1}(0.1) \approx -2.2$ (moderately transparent) so that Gaussians start semi-transparent and gradually become opaque where needed.

4. Color (Spherical Harmonic coefficients) $\mathbf{k}_{lm} \in \mathbb{R}^{48}$ — 48 parameters View-dependent color encoded as SH coefficients up to degree $l = 3$ , giving $(3+1)^2 = 16$ basis functions per color channel and $16 \times 3 = 48$ total coefficients. The degree-0 (DC) component captures the base diffuse color, while degrees 1-3 capture progressively higher-frequency view-dependent effects (directional shading, specular highlights). The DC term is initialized from the SfM point color; higher-order terms start at zero.

Why Gaussians are the ideal primitive:

Smooth falloff: The Gaussian function $G(\mathbf{x}) = \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}))$ decays smoothly to zero — neighboring Gaussians naturally blend with no hard edges or seam artifacts
Anisotropic: The full 3x3 covariance allows Gaussians to be thin discs (walls), elongated rods (edges), or spherical blobs — conforming to arbitrary geometry
Closed-form projection: A 3D Gaussian projects to a 2D Gaussian under perspective projection, enabling efficient screen-space evaluation without numerical integration
Differentiable everywhere: All parameters have smooth, well-defined gradients — the Gaussian function is infinitely differentiable
Compact: ~59 scalar parameters per Gaussian. For a typical scene with 2M Gaussians: $2M \times 59 \times 4$ bytes $\approx 450$ MB (can be compressed to ~50-200 MB with quantization)
No neural network: Rendering evaluates the analytic Gaussian function and performs alpha compositing — pure math, no learned weights at render time, enabling >100 FPS on consumer GPUs

Key Points

Each Gaussian stores 59 learnable scalars: position $\boldsymbol{\mu} \in \mathbb{R}^3$ (3), rotation quaternion $\mathbf{q} \in \mathbb{R}^4$ (4), scale $\mathbf{s} \in \mathbb{R}^3$ (3), opacity logit (1), and SH coefficients (48)

Covariance decomposed as $\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^T\mathbf{R}^T$ — quaternion rotation + diagonal scale guarantees positive semi-definiteness without constrained optimization

Initialized from Structure-from-Motion sparse point clouds (COLMAP): position from 3D points, scale from nearest-neighbor distances, DC color from point color, higher SH terms and rotation set to zero/identity

No neural network at render time — rendering evaluates the analytic Gaussian function and alpha-composites, achieving >100 FPS vs NeRF's ~0.03 FPS

Typical scenes use 1-5 million Gaussians (~59 params each), totaling 50-200 MB storage — compared to NeRF's ~5-50 MB MLP weights but with 100-1000x faster rendering

Opacity stored as logit (unconstrained $\mathbb{R}$ ) passed through sigmoid to ensure $\alpha \in [0,1]$ ; scale stored as log-scale for numerical stability during gradient descent

Mathematical Formulation

3D Gaussian Function

$G(\mathbf{x}) = \exp\left(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{x}-\boldsymbol{\mu})\right)$

Each Gaussian is an ellipsoidal density centered at mean mu in R^3, with shape and orientation determined by the 3x3 covariance matrix Sigma. The function evaluates to 1.0 at the center (x = mu) and falls off exponentially in all directions, with the rate of decay along each axis determined by the eigenvalues of Sigma (the principal scales). The Mahalanobis distance (x - mu)^T Sigma^-1 (x - mu) measures how many 'standard deviations' x is from the center.

Covariance Decomposition

$\boldsymbol{\Sigma} = \mathbf{R}\mathbf{S}\mathbf{S}^T\mathbf{R}^T$

The covariance is factored into rotation R (from a unit quaternion q in R^4, normalized to the unit sphere) and diagonal scaling S = diag(s_x, s_y, s_z). Since SS^T is always positive semi-definite and R is orthogonal, the product RSS^TR^T is guaranteed to be a valid positive semi-definite covariance matrix for any values of q and s. This eliminates the need for constrained optimization. Total: 7 parameters (4 quaternion + 3 scale) instead of 6 for the raw symmetric matrix, but with guaranteed validity.

Per-Gaussian Parameter Budget

$|\theta_g| = \underbrace{3}_{\boldsymbol{\mu}} + \underbrace{4}_{\mathbf{q}} + \underbrace{3}_{\mathbf{s}} + \underbrace{1}_{\alpha} + \underbrace{48}_{\text{SH}} = 59 \text{ parameters}$

For a scene with N = 2 million Gaussians, the total parameter count is 2M x 59 = 118 million learnable scalars. At 4 bytes each (float32), this is ~450 MB uncompressed. In practice, scenes range from 50-200 MB with compression and mixed precision.

Mathematical Intuition

Each primitive is a 3D anisotropic Gaussian $G(\mathbf{x}) = \exp(-\tfrac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^\top \Sigma^{-1}(\mathbf{x}-\boldsymbol{\mu}))$ with $\Sigma = RSS^\top R^\top$ . Decomposing into rotation $R$ (quaternion, 4) and scale $S$ (3) keeps $\Sigma$ positive semi-definite during gradient descent — a trick that makes the geometry differentiable without matrix square-roots.

Premium Content

Upgrade to PixelBank Premium to unlock this content.

Back to All Papers

Study Plans

3D CV Study Plan Diffusion Study Plan All Papers Practice Problems

Per-Gaussian parameters (59 scalars total):

Each Gaussian $g$ is defined by five learnable attributes:

Why Gaussians are the ideal primitive:

Smooth falloff: The Gaussian function $G(\mathbf{x}) = \exp(-\frac{1}{2}(\mathbf{x}-\boldsymbol{\mu})^T\boldsymbol{\Sigma}^{-1}(\mathbf{x}-\boldsymbol{\mu}))$ decays smoothly to zero — neighboring Gaussians naturally blend with no hard edges or seam artifacts
Anisotropic: The full 3x3 covariance allows Gaussians to be thin discs (walls), elongated rods (edges), or spherical blobs — conforming to arbitrary geometry
Closed-form projection: A 3D Gaussian projects to a 2D Gaussian under perspective projection, enabling efficient screen-space evaluation without numerical integration
Differentiable everywhere: All parameters have smooth, well-defined gradients — the Gaussian function is infinitely differentiable
Compact: ~59 scalar parameters per Gaussian. For a typical scene with 2M Gaussians: $2M \times 59 \times 4$ bytes $\approx 450$ MB (can be compressed to ~50-200 MB with quantization)
No neural network: Rendering evaluates the analytic Gaussian function and performs alpha compositing — pure math, no learned weights at render time, enabling >100 FPS on consumer GPUs

3D Gaussian Splatting

Paper Overview

Chapter Roadmap

3D Gaussian Primitives

Spherical Harmonics

Differentiable Splatting

Tile-Based Rasterization

Adaptive Density Control

Optimization Pipeline

Premium Content

Related Papers

Study Plans

3D Gaussian Splatting

Paper Overview

Chapter Roadmap

3D Gaussian Primitives

Spherical Harmonics

Differentiable Splatting

Tile-Based Rasterization

Adaptive Density Control

Optimization Pipeline

Premium Content

Related Papers

Study Plans