#1 Can Large Multimodal Models Inspect Buildings? A Hierarchical Benchmark for Structural Pathology Reasoning

Score: 20.6 | Matched keywords: ai, ai agents, benchmark, foundation models, multimodal, reasoning

Detailed Summary (EN)

Problem definition

The maintenance of aging urban building infrastructure is a critical challenge for public safety and smart city resilience [41, 48].
Traditionally, this domain has been dominated by specialized discriminative models, such as YOLO, R-CNN, and VGG [21, 40].
While these models excel at passive perception (i.e., localizing defects at the bounding-box level), they fundamentally lack the cognitive capability for active diagnosis.
Specifically, conventional discriminative approaches fail to capture high-order semantic relationships, thereby limiting current systems to rudimentary detection rather than actionable engineering diagnosis.

Core idea & method

leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology.
Building on this foundation, we present DefectBench, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition.
DefectBench evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation.
Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where").
Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training.

Experimental setup & results

To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology.
Building on this foundation, we present DefectBench, the first multi-dimensional benchmark designed to interrogate LMMs beyond basic semantic recognition.
DefectBench evaluates 18 state-of-the-art (SOTA) LMMs across three escalating cognitive dimensions: Semantic Perception, Spatial Localization, and Generative Geometry Segmentation.
Extensive experiments reveal that while current LMMs demonstrate exceptional topological awareness and semantic understanding (effectively diagnosing "what" and "how"), they exhibit significant deficiencies in metric localization precision ("where").
Crucially, however, we validate the viability of zero-shot generative segmentation, showing that general-purpose foundation models can rival specialized supervised networks without domain-specific training.

Limitations & risks

in the models’ current capacity for precise spatial quantification and logical enumeration.
Also, the models exhibit polarized performance across different defect taxonomies.
Crack and External Fixings represent the most bifurcated categories; while Claude achieves near-perfect MAE (0.0588 and 0.0444), lightweight models like Qwen2.5-VL-7B show high Relative Error (RE).
Conversely, the Surface Stain category poses a universal challenge, yielding consistently high MAE across all tested models.

Read-like-fullpaper digest

This paper addresses The maintenance of aging urban building infrastructure is a critical challenge for public safety and smart city resilience [41, 48]. The core method is leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology. Key empirical findings include To bridge this gap, we introduce a human-in-the-loop semi-automated annotation framework, leveraging expert-proposal verification to unify 12 fragmented datasets into a standardized, hierarchical ontology.

상세 요약 (KO)

문제 정의

노후화된 도시 건물 인프라의 유지 관리는 공공 안전과 스마트 시티 탄력성을 위한 중요한 과제입니다[41, 48].
전통적으로 이 영역은 YOLO, R-CNN 및 VGG[21, 40]와 같은 특수한 판별 모델에 의해 지배되었습니다.
이러한 모델은 수동적 인식(예: 경계 상자 수준에서 결함 위치 파악)에 탁월하지만 근본적으로 능동적 진단을 위한 인지 능력이 부족합니다.
특히 기존의 차별적 접근 방식은 고차 의미론적 관계를 포착하지 못하여 현재 시스템을 실행 가능한 엔지니어링 진단이 아닌 기초적인 감지로 제한합니다.

핵심 아이디어/방법

전문가 제안 검증을 활용하여 12개의 단편화된 데이터 세트를 표준화된 계층적 온톨로지로 통합합니다.
이러한 기반을 바탕으로 우리는 기본적인 의미 인식을 넘어 LMM을 조사하도록 설계된 최초의 다차원 벤치마크인 DefectBench를 제시합니다.
DefectBench는 의미론적 인식, 공간적 위치 파악, 생성적 기하학 분할이라는 세 가지 인지적 측면에서 18개의 최첨단(SOTA) LMM을 평가합니다.
광범위한 실험에 따르면 현재 LMM은 탁월한 위상 인식과 의미론적 이해("무엇"과 "어떻게"를 효과적으로 진단)를 보여주지만 메트릭 위치 파악 정확도("어디")에서는 상당한 결함이 있는 것으로 나타났습니다.
그러나 결정적으로 우리는 제로샷 생성 분할의 실행 가능성을 검증하여 범용 기반 모델이 도메인별 교육 없이 전문화된 감독 네트워크와 경쟁할 수 있음을 보여줍니다.

실험 설정/결과

이러한 격차를 해소하기 위해 우리는 전문가 제안 검증을 활용하여 12개의 단편화된 데이터세트를 표준화된 계층적 온톨로지로 통합하는 인간 참여형 반자동 주석 프레임워크를 도입합니다.
이러한 기반을 바탕으로 우리는 기본적인 의미 인식을 넘어 LMM을 조사하도록 설계된 최초의 다차원 벤치마크인 DefectBench를 제시합니다.
DefectBench는 의미론적 인식, 공간적 위치 파악, 생성적 기하학 분할이라는 세 가지 인지적 측면에서 18개의 최첨단(SOTA) LMM을 평가합니다.
광범위한 실험에 따르면 현재 LMM은 탁월한 위상 인식과 의미론적 이해("무엇"과 "어떻게"를 효과적으로 진단)를 보여주지만 메트릭 위치 파악 정확도("어디")에서는 상당한 결함이 있는 것으로 나타났습니다.
그러나 결정적으로 우리는 제로샷 생성 분할의 실행 가능성을 검증하여 범용 기반 모델이 도메인별 교육 없이 전문화된 감독 네트워크와 경쟁할 수 있음을 보여줍니다.

한계/리스크

정확한 공간 수량화 및 논리적 열거를 위한 모델의 현재 용량.
또한 모델은 다양한 결함 분류에 걸쳐 양극화된 성능을 나타냅니다.
균열 및 외부 고정은 가장 두 가지 범주를 나타냅니다. Claude는 거의 완벽한 MAE(0.0588 및 0.0444)를 달성하는 반면 Qwen2.5-VL-7B와 같은 경량 모델은 높은 상대 오류(RE)를 나타냅니다.
반대로, 표면 얼룩 카테고리는 테스트된 모든 모델에서 일관되게 높은 MAE를 산출하는 보편적인 문제를 제기합니다.

전체 논문 읽은 느낌 요약

이 문서에서는 노후화된 도시 건물 인프라의 유지 관리가 공공 안전과 스마트 시티 복원력을 위한 중요한 과제를 다루고 있습니다[41, 48]. 핵심 방법은 전문가 제안 검증을 활용하여 12개의 단편화된 데이터 세트를 표준화된 계층적 온톨로지로 통합하는 것입니다. 주요 경험적 결과는 다음과 같습니다. 이 격차를 해소하기 위해 우리는 전문가 제안 검증을 활용하여 12개의 단편화된 데이터 세트를 표준화된 계층적 온톨로지로 통합하는 인간 참여 반자동 주석 프레임워크를 도입합니다.