#4 Generation Models Know Space: Unleashing Implicit 3 D Priors for Scene Understanding

Score: 14.2 | Matched keywords: diffusion, large language models, multimodal, reasoning, token

Detailed Summary (EN)

Problem definition

Recent advancements in video generation models [3, 33, 37, 44, 66] have reshaped our expectations of visual systems, moving beyond high-fidelity generation to acting as interactive world models [38, 65, 70].
To generate a plausible video, the model inherently aligns appearance with 3D geometry: occlusion requires persistent object identity, camera motion reveals depth-dependent apparent motion, and interactions must follow consistent dynamics.
These constraints encourage latent representations that encode geometry-consistent structure and motion, yielding a strong learned 3D prior without explicit 3D supervision [39, 60].
This raises a compelling research question: if video generators already possess an implicit understanding of space and physics, can these implicit physical priors be repurposed to improve downstream 3D visual understanding?

Core idea & method

Large Language Model Large Language Model Text Encoder Point Encoder Text Encoder Visual Encoder User: What’s placed in a row next to the kitchen table?
··· Text Encoder User: What’s placed in a row next to the kitchen table?
Adaptive Gated Fusion (a) Explicit 3D Dependency (b) Extra Geometric Supervision e.g., … (c) Generative-Prior Enhanced Paradigm (Ours) Extra 3D Teacher/ Reconstruction Module Visual Encoder ··· Generative Enc.
Unlike methods relying on (a) explicit 3D inputs or (b) complex geometric supervision, (c) our VEGA-3D extracts implicit priors from video generation models.
By repurposing them as Latent World Simulators, we achieve (d) superior performance without external 3D dependencies.

Experimental setup & results

to establish a relative performance improvement, and then averaging them into a single scalar.
3, plotting the Correspondence Score against NOS reveals a distinct positive correlation, confirming that multi-view consistency is a strong predictor of 3D performance.
Furthermore, the results highlight a significant architectural divergence.
Models based on UNet architectures (e.g., SVD [4], 8 X.Wu et al.
We attribute this to the local inductive bias of convolutions and the insufficient scale of data, which limits the receptive field and hinders long-range geometric alignment.

Limitations & risks

We introduce VEGA-3D, a plug-and-play framework that repurposes modern video generation models as Latent World Simulators to mitigate the spatial blindness of MLLMs.
By activating these priors via noise injection and aligning them with semantic tokens through Adaptive Gated Fusion, VEGA-3D injects dense geometric anchors into MLLMs, consistently improving scene understanding, spatial reasoning, and manipulation without extra 3D supervision.
Generation Models Know Space: VEGA-3D 15 Limitations and Future Work.
Incorporating a video diffusion backbone increases inference cost (Fig.

Read-like-fullpaper digest

This paper addresses Recent advancements in video generation models [3, 33, 37, 44, 66] have reshaped our expectations of visual systems, moving beyond high-fidelity generation to acting as interactive world models [38, 65, 70]. The core method is Large Language Model Large Language Model Text Encoder Point Encoder Text Encoder Visual Encoder User: What’s placed in a row next to the kitchen table? Key empirical findings include to establish a relative performance improvement, and then averaging them into a single scalar.

상세 요약 (KO)

문제 정의

비디오 생성 모델의 최근 발전[3, 33, 37, 44, 66]은 시각 시스템에 대한 우리의 기대를 재편하여 고충실도 생성을 넘어 대화형 세계 모델로 작동하도록 했습니다[38, 65, 70].
그럴듯한 비디오를 생성하기 위해 모델은 본질적으로 모양을 3D 형상과 일치시킵니다. 폐색에는 지속적인 개체 식별이 필요하고, 카메라 동작은 깊이에 따른 겉보기 동작을 드러내며, 상호 작용은 일관된 역학을 따라야 합니다.
이러한 제약 조건은 기하학적으로 일관된 구조와 동작을 인코딩하는 잠재 표현을 장려하여 명시적인 3D 감독 없이 사전에 강력한 학습된 3D를 생성합니다[39, 60].
이는 강력한 연구 질문을 제기합니다. 비디오 생성기가 이미 공간과 물리학에 대한 암묵적인 이해를 보유하고 있다면 이러한 암묵적인 물리적 사전 지식을 재사용하여 다운스트림 3D 시각적 이해를 향상시킬 수 있습니까?

핵심 아이디어/방법

대규모 언어 모델 대규모 언어 모델 텍스트 인코더 포인트 인코더 텍스트 인코더 비주얼 인코더 사용자: 식탁 옆 줄에 무엇이 놓여 있나요?
··· 텍스트 인코더 사용자: 식탁 옆 줄에 무엇이 놓여 있나요?
적응형 게이트 융합 (a) 명시적 3D 종속성 (b) 추가 기하학적 감독 예: ... (c) 생성 우선 강화 패러다임(우리의 것) 추가 3D 교사/재구성 모듈 시각적 인코더 ··· Generative Enc.
(a) 명시적인 3D 입력 또는 (b) 복잡한 기하학적 감독에 의존하는 방법과 달리 (c) VEGA-3D는 비디오 생성 모델에서 암시적 사전 변수를 추출합니다.
이를 잠재 세계 시뮬레이터로 용도 변경함으로써 우리는 (d) 외부 3D 종속성 없이 뛰어난 성능을 달성합니다.

실험 설정/결과

상대적인 성능 향상을 설정한 다음 이를 단일 스칼라로 평균화합니다.
그림 3에서 NOS에 대한 대응 점수를 플로팅하면 뚜렷한 양의 상관 관계가 나타나 다중 뷰 일관성이 3D 성능의 강력한 예측 변수임을 확인할 수 있습니다.
또한 결과는 상당한 아키텍처 차이를 강조합니다.
UNet 아키텍처 기반 모델(예: SVD [4], 8 X.Wu et al.
우리는 이것을 컨볼루션의 국지적 유도 바이어스와 수용 필드를 제한하고 장거리 기하학적 얼라인먼트을 방해하는 데이터 규모가 충분하지 않기 때문이라고 생각합니다.

한계/리스크

MLLM의 공간 맹목성을 완화하기 위해 최신 비디오 생성 모델을 잠재 세계 시뮬레이터로 재사용하는 플러그 앤 플레이 프레임워크인 VEGA-3D를 소개합니다.
VEGA-3D는 노이즈 주입을 통해 이러한 사전을 활성화하고 Adaptive Gated Fusion을 통해 의미론적 토큰과 얼라인먼트함으로써 MLLM에 조밀한 기하학적 앵커를 주입하여 추가 3D 감독 없이 장면 이해, 공간 추론 및 조작을 지속적으로 향상시킵니다.
공간을 아는 세대 모델: VEGA-3D 15 제한 사항 및 향후 작업.
비디오 확산 백본을 통합하면 추론 비용이 증가합니다(그림 1).

전체 논문 읽은 느낌 요약

이 문서에서는 비디오 생성 모델의 최근 발전[3, 33, 37, 44, 66]이 시각 시스템에 대한 우리의 기대를 재구성하여 고충실도 생성을 넘어 대화형 세계 모델로 작동하도록 했습니다[38, 65, 70]. 핵심 방법은 대규모 언어 모델 대규모 언어 모델 텍스트 인코더 포인트 인코더 텍스트 인코더 비주얼 인코더 사용자: 식탁 옆 행에 무엇이 놓여 있나요? 주요 경험적 발견에는 상대적인 성능 향상을 확립한 다음 이를 단일 스칼라로 평균화하는 것이 포함됩니다.