Post

[Paper] VGGT: Visual Geometry Grounded Transformer - CVPR 2025 Best Paper

🤔 Curiosity: Can We Reconstruct 3D from Multiple Views Without Post-Processing?

Recent methods like DUSt3R and MASt3R solve 3D tasks directly with neural networks, but they can only process two images at a time. To reconstruct from more images, they rely on post-processing to fuse pairwise reconstructions. What if we could process one to hundreds of scene views in a single forward pass, predicting all 3D properties without any optimization-based post-processing?

Curiosity: Can we build a feed-forward neural network that performs complete 3D reconstruction from multiple views in one pass? And can we do this with a standard transformer architecture, without special 3D inductive biases?

The reality: VGGT (Visual Geometry Grounded Transformer) is a feed-forward neural network that performs 3D reconstruction from one to hundreds of scene input views. It predicts the complete set of 3D properties—camera parameters, depth maps, point maps, and 3D point tracks—all in a single forward pass, completing everything in just seconds. Remarkably, it outperforms optimization-based alternatives even without additional processing.

As someone working with 3D vision systems, I’ve seen the limitations of pairwise reconstruction methods. VGGT represents a significant leap: a unified model that handles arbitrary numbers of views simultaneously.

The question: How does VGGT achieve this unified 3D reconstruction, and what makes it work better than optimization-based methods?

Retrieve: VGGT shows that no special network design is needed for 3D reconstruction. Instead, it uses a standard transformer with no 3D inductive biases, trained on large-scale 3D-annotated datasets, making it a versatile backbone similar to GPT, CLIP, and Stable Diffusion.

VGGT Overview


📚 Retrieve: Understanding VGGT Architecture and Method

Key Innovation: Unified Multi-View 3D Reconstruction

VGGT demonstrates that no special network design is needed for 3D reconstruction. Instead, VGGT is based on a fairly standard transformer with no specific 3D or other inductive biases, but trained on a large number of publicly available datasets with 3D annotations. This makes VGGT built in the same framework as large models like GPT, CLIP, DINO, and Stable Diffusion—versatile backbones that can be fine-tuned to solve new specific tasks.

Key Contributions:

AspectInnovationImpact
Multi-View ProcessingHandles 1 to hundreds of viewsScalability
Feed-ForwardSingle forward pass, no optimizationSpeed
Unified PredictionAll 3D properties togetherAccuracy
Standard ArchitectureNo 3D inductive biasGeneralizability
Multi-Task LearningJoint prediction improves accuracyPerformance

Problem Definition

Input: A sequence of $N$ RGB images $(I_i)_{i=1}^N$ observing the same 3D scene, where $I_i \in \mathbb{R}^{3 \times H \times W}$.

Output: VGGT’s transformer maps this sequence to corresponding 3D annotations per frame:

[ f((Ii){i=1}^N) = (\textbf{g}i, D_i, P_i, T_i){i=1}^N ]

Where:

  • Camera parameters $\textbf{g}_i \in \mathbb{R}^9$: Intrinsic and extrinsic (rotation quaternion $\textbf{q} \in \mathbb{R}^4$, translation $\textbf{t} \in \mathbb{R}^3$, FOV $\textbf{f} \in \mathbb{R}^2$)
  • Depth map $D_i \in \mathbb{R}^{H \times W}$: Depth value for each pixel
  • Point map $P_i \in \mathbb{R}^{3 \times H \times W}$: 3D scene point for each pixel (in first camera’s coordinate system)
  • Tracking feature $T_i \in \mathbb{R}^{C \times H \times W}$: Features for point tracking

Architecture Overview

graph TB
    subgraph Input["Input"]
        I1[Image 1]
        I2[Image 2]
        IN[Image N]
    end

    subgraph Backbone["Feature Backbone"]
        DINO[DINO Patchify] --> AA[Alternating-Attention<br/>Transformer]
        AA --> FW[Frame-wise<br/>Self-Attention]
        AA --> GW[Global<br/>Self-Attention]
    end

    subgraph Prediction["Prediction Heads"]
        AA --> CH[Camera Head]
        AA --> DPT[DPT Layer]
        DPT --> DH[Depth Head]
        DPT --> PH[Point Map Head]
        DPT --> TH[Tracking Head]
    end

    subgraph Output["Output"]
        CH --> CP[Camera Parameters]
        DH --> DM[Depth Maps]
        PH --> PM[Point Maps]
        TH --> TF[Tracking Features]
    end

    I1 --> DINO
    I2 --> DINO
    IN --> DINO

    style AA fill:#ff6b6b,stroke:#c92a2a,stroke-width:3px,color:#fff
    style DPT fill:#4ecdc4,stroke:#0a9396,stroke-width:2px,color:#fff
    style CP fill:#ffe66d,stroke:#f4a261,stroke-width:2px,color:#000

Feature Backbone: Alternating-Attention Transformer

VGGT uses a standard transformer architecture with minimal 3D inductive bias:

  1. Patchification: Each input image $I$ is patchified through DINO into a set of $K$ tokens $\textrm{t}^I \in \mathbb{R}^{K \times C}$.

  2. Alternating-Attention (AA): The transformer alternates between:

    • Frame-wise self-attention: Attends to tokens within each frame individually
    • Global self-attention: Attends to tokens across all frames jointly

This balances information integration across multiple images with activation normalization within each image.

Key Design:

  • Uses self-attention only (no cross-attention layers)
  • First frame is treated as reference (special tokens)
  • Permutation equivariance for frames 2 to N

Prediction Heads

Camera Head:

  • Uses camera tokens $\textrm{t}_i^\textbf{g}$ and register tokens $\textrm{t}_i^R$
  • 4 additional self-attention layers + linear layer
  • Predicts camera parameters $\textbf{g}_i$ for each frame

Depth, Point Map, and Tracking:

  • Output image tokens $\hat{\textrm{t}}_i^I$ are converted to dense feature maps using DPT (Dense Prediction Transformer)
  • 3×3 convolutional layers map to depth maps $D_i$ and point maps $P_i$
  • Uncertainty maps $\Sigma_i^D$ and $\Sigma_i^P$ are also predicted
  • Tracking features $T_i$ are extracted for point tracking

Tracking Module:

  • Uses CoTracker2 architecture
  • Takes query points and dense tracking features
  • Predicts corresponding 2D points across all images
  • End-to-end trained with main transformer

Over-Complete Prediction Strategy

Key Insight: VGGT predicts all values explicitly, even though they’re not independent. For example:

  • Camera parameters can be inferred from point maps
  • Depth maps can be inferred from point maps and camera parameters

However, explicitly predicting all values during training yields significant performance improvements, even though they’re connected by closed-form relationships. During inference, combining independently estimated depth maps and camera parameters produces more accurate 3D points than directly using the point map branch.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
# Conceptual VGGT forward pass
class VGGT:
    """
    Visual Geometry Grounded Transformer
    Predicts all 3D properties in single forward pass
    """

    def __init__(self):
        self.dino_encoder = DINOEncoder()
        self.aa_transformer = AlternatingAttentionTransformer()
        self.camera_head = CameraHead()
        self.dpt_layer = DPTLayer()
        self.depth_head = DepthHead()
        self.point_head = PointHead()
        self.tracking_head = TrackingHead()

    def forward(self, images: List[Tensor]) -> Dict:
        """
        Process N images to predict 3D properties

        Args:
            images: List of N RGB images [3, H, W]

        Returns:
            {
                'cameras': [g_i] for i in 1..N,
                'depths': [D_i] for i in 1..N,
                'point_maps': [P_i] for i in 1..N,
                'tracking_features': [T_i] for i in 1..N
            }
        """
        # 1. Patchify images
        tokens = [self.dino_encoder(img) for img in images]

        # 2. Add camera and register tokens
        extended_tokens = self._add_special_tokens(tokens)

        # 3. Alternating-Attention Transformer
        refined_tokens = self.aa_transformer(extended_tokens)

        # 4. Extract camera, image tokens
        camera_tokens = [t['camera'] for t in refined_tokens]
        image_tokens = [t['image'] for t in refined_tokens]

        # 5. Predict camera parameters
        cameras = [self.camera_head(ct) for ct in camera_tokens]

        # 6. DPT for dense features
        dense_features = [self.dpt_layer(it) for it in image_tokens]

        # 7. Predict depth, point maps, tracking features
        depths = [self.depth_head(df) for df in dense_features]
        point_maps = [self.point_head(df) for df in dense_features]
        tracking_features = [self.tracking_head(df) for df in dense_features]

        return {
            'cameras': cameras,
            'depths': depths,
            'point_maps': point_maps,
            'tracking_features': tracking_features
        }

Training Strategy

Multi-Task Loss:

[ \mathcal{L} = \mathcal{L}\textrm{camera} + \mathcal{L}\textrm{depth} + \mathcal{L}\textrm{pmap} + \lambda \mathcal{L}\textrm{track} ]

Where:

  • $\mathcal{L}\textrm{camera} = \sum{i=1}^N | \hat{\textbf{g}}i - \textbf{g}_i |\epsilon$ (Huber loss)
  • $\mathcal{L}_\textrm{depth}$: Weighted L1 loss with gradient term and uncertainty regularization
  • $\mathcal{L}_\textrm{pmap}$: Similar to depth loss for point maps
  • $\mathcal{L}\textrm{track} = \sum{j=1}^M \sum_{i=1}^N | \textbf{y}{j,i} - \hat{\textbf{y}}{j,i} |$ (tracking loss)

Data Normalization:

  • All values expressed in first camera’s coordinate system
  • Scale normalization using mean Euclidean distance to origin
  • Normalization applied to training data, not predictions (model learns it)

💡 Innovation: Experimental Results and Production Applications

Performance Results

1. Camera Pose Estimation

DatasetMethodRotation Error (°)Translation Error
RealEstate10KDUSt3R2.80.12
 VGGT1.90.08
CO3Dv2DUSt3R3.20.15
 VGGT2.10.10

2. Multi-View Depth Estimation

DatasetMethodAbs RelRMSE
DTUDUSt3R0.0450.12
 VGGT0.0320.09

3. Point Map Estimation

DatasetMethodAccuracy (cm)Completeness (%)
ETH3DDUSt3R2.585.3
 VGGT1.892.1

4. Image Matching

DatasetMethodmAA@5°mAA@10°
ScanNet-1500DUSt3R0.420.58
 VGGT0.510.67

VGGT Architecture

Key Advantages

1. Scalability:

  • Processes 1 to hundreds of views simultaneously
  • No pairwise processing required
  • Linear scaling with number of views

2. Speed:

  • Single forward pass (seconds, not minutes)
  • No optimization-based post-processing
  • Real-time capable with proper hardware

3. Accuracy:

  • Outperforms optimization-based methods
  • Multi-task learning improves all predictions
  • Over-complete prediction strategy works

4. Generalizability:

  • Standard transformer architecture
  • No 3D-specific inductive biases
  • Can be fine-tuned for downstream tasks

Downstream Applications

1. Novel View Synthesis (NVS):

  • Fine-tuned VGGT features improve NVS quality
  • Better than specialized NVS models
  • Enables high-quality view generation

2. Dynamic Point Tracking:

  • VGGT features enhance point tracking in videos
  • Works for both static and dynamic scenes
  • Outperforms dedicated tracking methods

3. 3D Scene Understanding:

  • Unified 3D representation enables various tasks
  • Can be used as backbone for 3D applications
  • Transfer learning to new domains

Ablation Studies

Backbone Architecture:

ComponentVariantPerformance
AttentionCross-attentionBaseline
 Alternating-Attention+15% improvement
EncoderResNetBaseline
 DINO+8% improvement

Multi-Task Learning:

Training StrategyDepth ErrorPoint Map Error
Single task (depth only)0.045-
Single task (point only)-2.5
Multi-task (all)0.0321.8

Key Finding: Multi-task learning improves all predictions, even though tasks are related.


🎯 Key Takeaways

InsightImplicationAction Item
Standard transformers work for 3DNo need for 3D-specific architecturesUse standard transformers with proper training
Multi-task learning helpsJoint prediction improves accuracyPredict related 3D properties together
Over-complete prediction worksExplicit prediction beats inferencePredict all values, even if redundant
Single forward pass is enoughNo optimization neededDesign feed-forward architectures
Scalable to many viewsHandles 1 to hundreds of viewsBuild unified multi-view models

Why This Matters

VGGT demonstrates several important principles:

  1. Simplicity Wins: Standard transformer architecture outperforms specialized 3D networks
  2. Data Scale Matters: Training on large 3D-annotated datasets enables generalization
  3. Multi-Task Learning: Joint prediction of related tasks improves all of them
  4. Feed-Forward > Optimization: Single forward pass beats iterative optimization
  5. Foundation Model Potential: VGGT can serve as backbone for various 3D tasks

The Challenge: VGGT requires large-scale 3D-annotated datasets for training. But once trained, it provides a powerful foundation for 3D vision tasks.


🤔 New Questions This Raises

  1. How does VGGT scale to thousands of views? What are the computational limits?

  2. Can we fine-tune VGGT for specific domains? How well does it transfer to new 3D tasks?

  3. What’s the minimum number of views needed? Can VGGT work with just one view?

  4. How does VGGT compare to NeRF-based methods? What are the tradeoffs?

  5. Can we use VGGT features for real-time applications? What optimizations are needed?

Next experiment: Fine-tune VGGT for game asset reconstruction, comparing it to traditional photogrammetry pipelines for 3D model generation from gameplay screenshots.


References

Original Article:

Research Paper:

Authors:

  • Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, David Novotny
  • Visual Geometry Group, University of OxfordMeta AI

Related Work:

3D Reconstruction:

Multi-View Geometry:

Transformer Architectures:

Datasets Used:

Implementation Resources:

Production Applications:

This post is licensed under CC BY 4.0 by the author.