#9 Colon-Bench: An Agentic Workflow for Scalable Dense Lesion Annotation in Full-Procedure Colonoscopy Videos

Score: 13.2 | Matched keywords: ai, benchmark, large language models, multimodal

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17]. Furthermore, because colonoscopy procedures are inherently long, efficient spatiotemporal analysis is critical; approaches utilizing state-space models [24] and token merging [26] have been introduced to capture long-range dependencies while reducing computational redundancy. Moreover, manual dense annotation of colonoscopy videos is labor-intensive and inconsistent, which motivates our work on a scalable, affordable pipeline for dense colonoscopy video annotation, facilitating AI research, evaluation, and applications in colonoscopy.

The core proposal is 1, our dense colonoscopy annotations were constructed by applying a multi-stage agentic pipeline to 60 video sequences from the REAL-COLON dataset [3]. success of the Gemini-3 vision model for medical domains [4] has prompted the use of Gemini-3 as one tool in the annotation pipeline. To enable efficient review at scale, the pipeline pre-rendered short clips with spatial overlays into an interactive web interface. 2 Methodology 2.1 Agentic Workflow for Colonoscopy Annotation Automatic annotations with quality control.

The empirical case is built around These architectural and synthetic workarounds underscore the critical gap in the field that our work addresses: the need for a comprehensive, densely annotated, long-sequence dataset to ground the spatiotemporal analysis of real colonoscopies. To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17]. These architectural and synthetic workarounds underscore the critical gap in the field that our work addresses: the need for a comprehensive, densely annotated, long-sequence dataset to ground the spatiotemporal analysis of real colonoscopies. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs.

The central reported finding is To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17]. demonstrate surprisingly high localization performance in medical domains compared to SAM-3. Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17].
Most important supporting result: demonstrate surprisingly high localization performance in medical domains compared to SAM-3.

Problem definition

To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17].
Furthermore, because colonoscopy procedures are inherently long, efficient spatiotemporal analysis is critical; approaches utilizing state-space models [24] and token merging [26] have been introduced to capture long-range dependencies while reducing computational redundancy.
Moreover, manual dense annotation of colonoscopy videos is labor-intensive and inconsistent, which motivates our work on a scalable, affordable pipeline for dense colonoscopy video annotation, facilitating AI research, evaluation, and applications in colonoscopy.
These architectural and synthetic workarounds underscore the critical gap in the field that our work addresses: the need for a comprehensive, densely annotated, long-sequence dataset to ground the spatiotemporal analysis of real colonoscopies.

Core idea & method

1, our dense colonoscopy annotations were constructed by applying a multi-stage agentic pipeline to 60 video sequences from the REAL-COLON dataset [3].
success of the Gemini-3 vision model for medical domains [4] has prompted the use of Gemini-3 as one tool in the annotation pipeline.
To enable efficient review at scale, the pipeline pre-rendered short clips with spatial overlays into an interactive web interface.
2 Methodology 2.1 Agentic Workflow for Colonoscopy Annotation Automatic annotations with quality control.
Initial video verification followed by EdgeTAM [28] tracking (efficient SAM-based tracking and segmentation) and AI confirmation simultaneously established the spatial annotations (yielding over 314k initial bounding boxes) and radically pruned weak temporal boundaries.
Successive verification filtering agent, bounding-box tracking, cued AI confirmation agent (using an overlay of the box on the lesion as cues), and a final human review progressively filtered this set to isolate high-quality detections.

Actual findings

To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17].
demonstrate surprisingly high localization performance in medical domains compared to SAM-3.

How the conclusion was reached

Step 1 — Proposed approach: 1, our dense colonoscopy annotations were constructed by applying a multi-stage agentic pipeline to 60 video sequences from the REAL-COLON dataset [3].
Step 2 — Evaluation setup or comparison basis: These architectural and synthetic workarounds underscore the critical gap in the field that our work addresses: the need for a comprehensive, densely annotated, long-sequence dataset to ground the spatiotemporal analysis of real colonoscopies.
Step 3 — Main reported evidence: To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17].
Step 4 — Additional supporting or qualifying result: demonstrate surprisingly high localization performance in medical domains compared to SAM-3.

Experimental setup & results

To address the inherent visual challenges of colonoscopy videos, recent methods have proposed specialized architectures optimized for pixel-level precision and subtle lesion identification [27,17].
These architectural and synthetic workarounds underscore the critical gap in the field that our work addresses: the need for a comprehensive, densely annotated, long-sequence dataset to ground the spatiotemporal analysis of real colonoscopies.
Finally, we analyze common VQA errors from MLLMs to introduce a novel "colon-skill" prompting strategy, improving zero-shot MLLM performance by up to 9.7% across most MLLMs.
demonstrate surprisingly high localization performance in medical domains compared to SAM-3.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 논문에서는 대장내시경 비디오에 내재된 시각적 문제를 해결하기 위해 최근 방법에서 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17]. 더욱이, 대장내시경 검사 절차는 본질적으로 길기 때문에 효율적인 시공간 분석이 중요합니다. 상태 공간 모델[24]과 토큰 병합[26]을 활용하는 접근 방식은 계산 중복성을 줄이면서 장거리 종속성을 캡처하기 위해 도입되었습니다. 또한 대장 내시경 비디오의 수동 밀집 주석은 노동 집약적이고 일관성이 없기 때문에 밀집 대장 내시경 비디오 주석을 위한 확장 가능하고 저렴한 파이프라인에 대한 작업에 동기를 부여하고 대장 내시경에서 AI 연구, 평가 및 적용을 촉진합니다. 핵심 제안은 1입니다. 우리의 조밀한 대장내시경 주석은 REAL-COLON 데이터세트의 60개 비디오 시퀀스에 다단계 에이전트 파이프라인을 적용하여 구성되었습니다[3]. 의료 분야에 대한 Gemini-3 비전 모델의 성공[4]으로 인해 주석 파이프라인의 하나의 도구로 Gemini-3이 사용되었습니다. 대규모로 효율적으로 검토할 수 있도록 파이프라인은 공간 오버레이가 포함된 짧은 클립을 대화형 웹 인터페이스에 사전 렌더링했습니다. 2 방법론 2.1 대장내시경 주석을 위한 에이전트 작업 흐름 품질 관리를 통한 자동 주석입니다. 경험적 사례는 이러한 아키텍처 및 합성 해결 방법을 기반으로 구축되었으며, 이는 우리 작업이 다루는 분야의 중요한 격차, 즉 실제 대장 내시경의 시공간 분석을 기반으로 하는 포괄적이고 조밀하게 주석이 달린 긴 시퀀스 데이터 세트의 필요성을 강조합니다. 대장내시경 비디오의 고유한 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17]. 이러한 아키텍처 및 합성 해결 방법은 우리 작업이 다루는 분야의 중요한 격차, 즉 실제 대장 내시경의 시공간 분석을 기반으로 하는 포괄적이고 조밀하게 주석이 달린 긴 시퀀스 데이터 세트의 필요성을 강조합니다. 마지막으로 MLLM의 일반적인 VQA 오류를 분석하여 새로운 "콜론 기술" 프롬프트 전략을 도입하여 대부분의 MLLM에서 제로 샷 MLLM 성능을 최대 9.7%까지 향상시킵니다. 보고된 핵심 결과는 대장내시경 비디오에 내재된 시각적 문제를 해결하기 위해 최근 방법에서 픽셀 수준 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했다는 것입니다[27,17]. SAM-3에 비해 의료 분야에서 놀라울 정도로 높은 위치 파악 성능을 보여줍니다. 마지막으로 MLLM의 일반적인 VQA 오류를 분석하여 새로운 "콜론 기술" 프롬프트 전략을 도입하여 대부분의 MLLM에서 제로 샷 MLLM 성능을 최대 9.7%까지 향상시킵니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 대장내시경 비디오에 내재된 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17].
가장 중요한 지원 결과: SAM-3에 비해 의료 분야에서 놀랍도록 높은 위치 파악 성능을 보여줍니다.

문제 정의

대장내시경 비디오의 고유한 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17].
더욱이, 대장내시경 검사 절차는 본질적으로 길기 때문에 효율적인 시공간 분석이 중요합니다. 상태 공간 모델[24]과 토큰 병합[26]을 활용하는 접근 방식은 계산 중복성을 줄이면서 장거리 종속성을 캡처하기 위해 도입되었습니다.
또한 대장 내시경 비디오의 수동 밀집 주석은 노동 집약적이고 일관성이 없기 때문에 밀집 대장 내시경 비디오 주석을 위한 확장 가능하고 저렴한 파이프라인에 대한 작업에 동기를 부여하고 대장 내시경에서 AI 연구, 평가 및 적용을 촉진합니다.
이러한 아키텍처 및 합성 해결 방법은 우리 작업이 다루는 분야의 중요한 격차, 즉 실제 대장 내시경의 시공간 분석을 기반으로 하는 포괄적이고 조밀하게 주석이 달린 긴 시퀀스 데이터 세트의 필요성을 강조합니다.

핵심 아이디어/방법

1에서, REAL-COLON 데이터세트[3]의 60개 비디오 시퀀스에 다단계 에이전트 파이프라인을 적용하여 조밀한 대장내시경 주석을 구성했습니다.
의료 분야에 대한 Gemini-3 비전 모델의 성공[4]으로 인해 주석 파이프라인의 하나의 도구로 Gemini-3이 사용되었습니다.
대규모로 효율적으로 검토할 수 있도록 파이프라인은 공간 오버레이가 포함된 짧은 클립을 대화형 웹 인터페이스에 사전 렌더링했습니다.
2 방법론 2.1 대장내시경 주석을 위한 에이전트 작업 흐름 품질 관리를 통한 자동 주석입니다.
EdgeTAM[28] 추적(효율적인 SAM 기반 추적 및 분할)과 AI 확인이 뒤따르는 초기 비디오 검증은 동시에 공간 주석(314k 이상의 초기 경계 상자 생성)을 설정하고 약한 시간적 경계를 근본적으로 정리했습니다.
연속적인 검증 필터링 에이전트, 경계 상자 추적, 신호를 받은 AI 확인 에이전트(병변의 상자 오버레이를 신호로 사용) 및 최종 인적 검토를 통해 이 세트를 점진적으로 필터링하여 고품질 탐지를 분리했습니다.

실제 결과

대장내시경 비디오의 고유한 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17].
SAM-3에 비해 의료 분야에서 놀라울 정도로 높은 위치 파악 성능을 보여줍니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 1, REAL-COLON 데이터 세트의 60개 비디오 시퀀스에 다단계 에이전트 파이프라인을 적용하여 조밀한 대장내시경 주석을 구성했습니다[3].
2단계 — 평가 설정 또는 비교 기준: 이러한 아키텍처 및 합성 해결 방법은 우리 작업이 다루는 분야의 중요한 격차, 즉 실제 대장 내시경의 시공간 분석을 기반으로 하는 포괄적이고 조밀하게 주석이 달린 긴 시퀀스 데이터 세트의 필요성을 강조합니다.
3단계 — 보고된 주요 증거: 대장내시경 비디오에 내재된 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17].
4단계 - 추가 지원 또는 적격 결과: SAM-3에 비해 의료 분야에서 놀랍도록 높은 위치 파악 성능을 보여줍니다.

실험 설정/결과

대장내시경 비디오의 고유한 시각적 문제를 해결하기 위해 최근 방법에서는 픽셀 수준의 정밀도와 미묘한 병변 식별에 최적화된 특수 아키텍처를 제안했습니다[27,17].
이러한 아키텍처 및 합성 해결 방법은 우리 작업이 다루는 분야의 중요한 격차, 즉 실제 대장 내시경의 시공간 분석을 기반으로 하는 포괄적이고 조밀하게 주석이 달린 긴 시퀀스 데이터 세트의 필요성을 강조합니다.
마지막으로 MLLM의 일반적인 VQA 오류를 분석하여 새로운 "콜론 기술" 프롬프트 전략을 도입하여 대부분의 MLLM에서 제로 샷 MLLM 성능을 최대 9.7%까지 향상시킵니다.
SAM-3에 비해 의료 분야에서 놀라울 정도로 높은 위치 파악 성능을 보여줍니다.