#9 3 DCity-LLM: Empowering Multi-modality Large Language Models for 3 D City-scale Perception and Understanding

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Designing multi-modality large language models (MLLMs) at this scale requires not only recognizing individual objects but also modeling their interactions, functional roles, and contextual significance within the broader urban system. As LLMs generate diverse, long-form responses for open-ended urban tasks, traditional text-similarity metrics (e.g., BLEU, ROUGE, METEOR) fail to capture semantic equivalence, especially for complex open-ended questions. As a result, answers that are logically coherent and factually accurate may be penalized simply because they adopt different wording, sentence structure, or narrative style from the ground truth.

The core proposal is To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence. Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods. Such tasks highlight the need for a unified framework that can simultaneously perform 3D object perception, relationship calculation, and holistic scene understanding.

The empirical case is built around To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning. Both of the answers are correct, yet traditional metrics would assign them with different scores.

The central reported finding is Both of the answers are correct, yet traditional metrics would assign them with different scores.

The paper also makes it clear that However, its dependency on static generation templates (e.g., localization, measurement, functionality, and logical reasoning) results in syntactic homogeneity, which restricts linguistic diversity and the capacity for openended urban understanding. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Both of the answers are correct, yet traditional metrics would assign them with different scores.
Important caution: However, its dependency on static generation templates (e.g., localization, measurement, functionality, and logical reasoning) results in syntactic homogeneity, which restricts linguistic diversity and the capacity for openended urban understanding.

Problem definition

Designing multi-modality large language models (MLLMs) at this scale requires not only recognizing individual objects but also modeling their interactions, functional roles, and contextual significance within the broader urban system.
As LLMs generate diverse, long-form responses for open-ended urban tasks, traditional text-similarity metrics (e.g., BLEU, ROUGE, METEOR) fail to capture semantic equivalence, especially for complex open-ended questions.
As a result, answers that are logically coherent and factually accurate may be penalized simply because they adopt different wording, sentence structure, or narrative style from the ground truth.
Unlike indoor benchmarks that involve a limited number of objects, a city scene usually contains thousands of entities with heterogeneous attributes and intricate spatial relationships.

Core idea & method

To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning.
Extensive experiments on two benchmarks demonstrate that 3DCity-LLM significantly outperforms existing state-of-the-art methods, offering a promising and meaningful direction for advancing spatial reasoning and urban intelligence.
Furthermore, we apply a multi-dimensional protocol based on text-similarity metrics and LLM-based semantic assessment to ensure faithful and comprehensive evaluations for all methods.
Such tasks highlight the need for a unified framework that can simultaneously perform 3D object perception, relationship calculation, and holistic scene understanding.
2024) have shown that language-centric architectures can be adapted for cross-modality understanding.
Designing multi-modality large language models (MLLMs) at this scale requires not only recognizing individual objects but also modeling their interactions, functional roles, and contextual significance within the broader urban system.

Actual findings

Both of the answers are correct, yet traditional metrics would assign them with different scores.

How the conclusion was reached

Step 1 — Proposed approach: To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning.
Step 2 — Evaluation setup or comparison basis: To facilitate large-scale training, we introduce 3DCity-LLM-1.2M dataset that comprises approximately 1.2 million high-quality samples across seven representative task categories, ranging from fine-grained object analysis to multi-faceted scene planning.
Step 3 — Main reported evidence: Both of the answers are correct, yet traditional metrics would assign them with different scores.
Step 5 — Claim boundary / limitation: However, its dependency on static generation templates (e.g., localization, measurement, functionality, and logical reasoning) results in syntactic homogeneity, which restricts linguistic diversity and the capacity for openended urban understanding.

Experimental setup & results

Both of the answers are correct, yet traditional metrics would assign them with different scores.

Limitations & risks

However, its dependency on static generation templates (e.g., localization, measurement, functionality, and logical reasoning) results in syntactic homogeneity, which restricts linguistic diversity and the capacity for openended urban understanding.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 이 규모의 다중 양식 대규모 언어 모델(MLLM)을 설계하려면 개별 객체를 인식할 뿐만 아니라 더 넓은 도시 시스템 내에서 객체의 상호 작용, 기능적 역할 및 상황적 중요성을 모델링해야 합니다. LLM은 개방형 도시 작업에 대해 다양하고 긴 형식의 응답을 생성하므로 전통적인 텍스트 유사성 측정항목(예: BLEU, ROUGE, METEOR)은 특히 복잡한 개방형 질문의 경우 의미론적 동등성을 포착하지 못합니다. 결과적으로, 논리적으로 일관되고 사실적으로 정확한 답변은 실제 사실과 다른 표현, 문장 구조 또는 설명 스타일을 채택했다는 이유만으로 불이익을 받을 수 있습니다. 핵심 제안은 대규모 훈련을 촉진하기 위해 세밀한 객체 분석부터 다면적인 장면 계획에 이르기까지 7가지 대표적인 작업 범주에 걸쳐 약 120만 개의 고품질 샘플로 구성된 3DCity-LLM-1.2M 데이터세트를 도입하는 것입니다. 두 가지 벤치마크에 대한 광범위한 실험을 통해 3DCity-LLM이 기존의 최첨단 방법보다 훨씬 뛰어난 성능을 발휘하여 공간 추론 및 도시 지능을 발전시키기 위한 유망하고 의미 있는 방향을 제시하는 것으로 나타났습니다. 또한 텍스트 유사성 메트릭과 LLM 기반 의미 평가를 기반으로 하는 다차원 프로토콜을 적용하여 모든 방법에 대한 충실하고 포괄적인 평가를 보장합니다. 이러한 작업은 3D 객체 인식, 관계 계산 및 전체적인 장면 이해를 동시에 수행할 수 있는 통합 프레임워크의 필요성을 강조합니다. 경험적 사례는 대규모 훈련을 용이하게 하기 위해 세밀한 객체 분석부터 다면적인 장면 계획에 이르기까지 7가지 대표적인 작업 범주에 걸쳐 약 120만 개의 고품질 샘플로 구성된 3DCity-LLM-1.2M 데이터세트를 소개합니다. 두 답변 모두 정확하지만 기존 측정항목에서는 답변에 서로 다른 점수를 할당합니다. 보고된 중앙 결과는 두 답변 모두 정확하지만 기존 측정항목에서는 두 답변에 서로 다른 점수를 할당한다는 것입니다. 그러나 이 논문에서는 정적 생성 템플릿(예: 지역화, 측정, 기능 및 논리적 추론)에 대한 의존성이 구문 동질성을 초래하여 언어 다양성과 개방형 도시 이해 능력을 제한한다는 점을 분명히 밝혔습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 두 답변 모두 정확하지만 기존 측정 기준에서는 두 답변에 서로 다른 점수를 할당합니다.
중요한 주의 사항: 그러나 정적 생성 템플릿(예: 지역화, 측정, 기능 및 논리적 추론)에 대한 의존성은 구문적 동질성을 초래하여 언어 다양성과 개방형 도시 이해 능력을 제한합니다.

문제 정의

이 규모의 다중 양식 대규모 언어 모델(MLLM)을 설계하려면 개별 객체를 인식할 뿐만 아니라 더 넓은 도시 시스템 내에서 객체의 상호 작용, 기능적 역할 및 상황적 중요성을 모델링해야 합니다.
LLM은 개방형 도시 작업에 대해 다양하고 긴 형식의 응답을 생성하므로 전통적인 텍스트 유사성 측정항목(예: BLEU, ROUGE, METEOR)은 특히 복잡한 개방형 질문의 경우 의미론적 동등성을 포착하지 못합니다.
결과적으로, 논리적으로 일관되고 사실적으로 정확한 답변은 실제 사실과 다른 표현, 문장 구조 또는 설명 스타일을 채택했다는 이유만으로 불이익을 받을 수 있습니다.
제한된 수의 객체를 포함하는 실내 벤치마크와 달리 도시 장면에는 일반적으로 이질적인 속성과 복잡한 공간 관계를 가진 수천 개의 개체가 포함됩니다.

핵심 아이디어/방법

대규모 훈련을 촉진하기 위해 세분화된 객체 분석부터 다면적 장면 계획에 이르기까지 7가지 대표적인 작업 범주에 걸쳐 약 120만 개의 고품질 샘플로 구성된 3DCity-LLM-1.2M 데이터 세트를 소개합니다.
두 가지 벤치마크에 대한 광범위한 실험을 통해 3DCity-LLM이 기존의 최첨단 방법보다 훨씬 뛰어난 성능을 발휘하여 공간 추론 및 도시 지능을 발전시키기 위한 유망하고 의미 있는 방향을 제시하는 것으로 나타났습니다.
또한 텍스트 유사성 메트릭과 LLM 기반 의미 평가를 기반으로 하는 다차원 프로토콜을 적용하여 모든 방법에 대한 충실하고 포괄적인 평가를 보장합니다.
이러한 작업은 3D 객체 인식, 관계 계산 및 전체적인 장면 이해를 동시에 수행할 수 있는 통합 프레임워크의 필요성을 강조합니다.
2024)은 언어 중심 아키텍처가 양식 간 이해에 맞게 조정될 수 있음을 보여주었습니다.
이 규모의 다중 양식 대규모 언어 모델(MLLM)을 설계하려면 개별 객체를 인식할 뿐만 아니라 더 넓은 도시 시스템 내에서 객체의 상호 작용, 기능적 역할 및 상황적 중요성을 모델링해야 합니다.

실제 결과

두 답변 모두 정확하지만 기존 측정항목에서는 답변에 서로 다른 점수를 할당합니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 대규모 훈련을 촉진하기 위해 세분화된 객체 분석부터 다면적 장면 계획에 이르기까지 7가지 대표적인 작업 범주에 걸쳐 약 120만 개의 고품질 샘플로 구성된 3DCity-LLM-1.2M 데이터세트를 도입합니다.
2단계 — 평가 설정 또는 비교 기준: 대규모 훈련을 용이하게 하기 위해 세밀한 객체 분석부터 다면적인 장면 계획에 이르기까지 7가지 대표적인 작업 범주에 걸쳐 약 120만 개의 고품질 샘플로 구성된 3DCity-LLM-1.2M 데이터세트를 도입합니다.
3단계 - 보고된 주요 증거: 두 답변 모두 정확하지만 기존 측정 기준에서는 서로 다른 점수를 할당합니다.
5단계 — 주장 경계/제한: 그러나 정적 생성 템플릿(예: 지역화, 측정, 기능 및 논리적 추론)에 대한 의존성은 구문적 동질성을 초래하여 언어 다양성과 개방형 도시 이해 능력을 제한합니다.

실험 설정/결과

두 답변 모두 정확하지만 기존 측정항목에서는 답변에 서로 다른 점수를 할당합니다.

한계/리스크

그러나 정적 생성 템플릿(예: 지역화, 측정, 기능 및 논리적 추론)에 대한 의존성은 구문적 동질성을 초래하여 언어 다양성과 개방형 도시 이해 능력을 제한합니다.