#4 Chain-of-Adaptation: Surgical Vision-Language Adaptation with Reinforcement Learning

Score: 16.4 | Matched keywords: alignment, fine-tuning, multimodal, reasoning

Detailed Summary (EN)

Problem definition

Recent advances in vision–language models (VLMs) [3, 19, 35], have driven substantial progress in visual understanding and multimodal reasoning, enabling models to interpret complex scenes and generate coherent natural-language responses.
While in surgical applications, their performance remains limited due to significant distribution shift from general to specific domain.
These shifts include (1) the imagery modality and semantic shifts, (2) terminology shifts (e.g.
anatomical structures, procedures, instruments), and (3) task requirement shifts (domain-expert-level reasoning).

Core idea & method

’s pretrained multimodal priors, leading to reduced generalization.
To address this, we propose Chain-of-Adaptation (CoA), an adaptation framework designed to integrate domain knowledge while maintaining the model’s inherent reasoning and perceptual capabilities.
CoA introduces a structured reasoning format that enhances domain alignment without sacrificing general multimodal competence by reinforcement learning.
Experiments on standard surgical benchmarks, under both in-distribution and out-of-distribution settings, demonstrate that CoA achieves higher accuracy, stronger generalization, and more stable behavior than supervised fine-tuning.
Furthermore, ablation studies confirm that CoA effectively preserves the model’s core visual–language abilities, providing a reliable pathway for domain specialization in VLMs.

Experimental setup & results

models are still struggling to generate meaningful responses at scale, either because short-phrase or single word responses lacking semantic richness like in Surgical-VQLA [4], or contextually losing like in some LLM-enhanced models [17, 41].
To make matters worse, supervised fine-tuning (SFT) often generalizes poorly [9, 44], and frequently causes overfitting to narrow instruction templates and may induce catastrophic forgetting [15, 21], or even model collapse [27], degrading the model’s base reasoning and language abilities.
Fortunately, in practice, current VLMs such as QwenVL [3] already exhibit strong visual grounding, and they can recognize positional, geometric, and temporal attributes of surgical scenes.
Leveraging these existing abilities, the core challenge for adapting the VLMs to surgical domains lies not in visual perception, but in the gain of ability to map these observations to clinically meaningful surgical concepts and terminology.
This insight raises a key question: Can we adapt pretrained VLMs to express visual content in clinically accurate ways while leveraging their existing multimodal competence, rather than overwriting it?

Limitations & risks

of annotations, i.e., publicly available surgical datasets provide mostly low-level annotations (e.g., categorical labels such as surgical phase, instrument type, or coarse action) while lacking descriptive or reasoning-rich annotations and showing limited linguistic diversity.
As a results, models are still struggling to generate meaningful responses at scale, either because short-phrase or single word responses lacking semantic richness like in Surgical-VQLA [4], or contextually losing like in some LLM-enhanced models [17, 41].
To make matters worse, supervised fine-tuning (SFT) often generalizes poorly [9, 44], and frequently causes overfitting to narrow instruction templates and may induce catastrophic forgetting [15, 21], or even model collapse [27], degrading the model’s base reasoning and language abilities.
Fortunately, in practice, current VLMs such as QwenVL [3] already exhibit strong visual grounding, and they can recognize positional, geometric, and temporal attributes of surgical scenes.

Read-like-fullpaper digest

This paper addresses Recent advances in vision–language models (VLMs) [3, 19, 35], have driven substantial progress in visual understanding and multimodal reasoning, enabling models to interpret complex scenes and generate coherent natural-language responses. The core method is ’s pretrained multimodal priors, leading to reduced generalization. Key empirical findings include models are still struggling to generate meaningful responses at scale, either because short-phrase or single word responses lacking semantic richness like in Surgical-VQLA [4], or contextually losing like in some LLM-enhanced models [17, 41].

상세 요약 (KO)

문제 정의

VLM(비전 언어 모델)[3, 19, 35]의 최근 발전으로 인해 시각적 이해와 다중 모달 추론이 크게 발전하여 모델이 복잡한 장면을 해석하고 일관된 자연어 응답을 생성할 수 있게 되었습니다.
수술 응용 분야에서는 일반 영역에서 특정 영역으로의 상당한 분포 이동으로 인해 성능이 제한적입니다.
이러한 변화에는 (1) 이미지 양식 및 의미론적 변화, (2) 용어 변화(예:
해부학적 구조, 절차, 도구) 및 (3) 작업 요구 사항 변경(영역 전문가 수준 추론).

핵심 아이디어/방법

의 사전 훈련된 다중 모달 사전 분석으로 인해 일반화가 줄어듭니다.
이 문제를 해결하기 위해 우리는 모델의 고유한 추론과 지각 기능을 유지하면서 도메인 지식을 통합하도록 설계된 적응 프레임워크인 CoA(Chain-of-Adaptation)를 제안합니다.
CoA는 강화 학습을 통해 일반적인 다중 모드 역량을 희생하지 않고 도메인 얼라인먼트을 향상시키는 구조화된 추론 형식을 도입합니다.
분포 내 및 분포 외 설정 모두에서 표준 수술 벤치마크에 대한 실험은 CoA가 감독된 미세 조정보다 더 높은 정확도, 더 강력한 일반화 및 더 안정적인 동작을 달성한다는 것을 보여줍니다.
또한 절제 연구는 CoA가 모델의 핵심 시각 언어 능력을 효과적으로 보존하여 VLM의 도메인 전문화를 위한 신뢰할 수 있는 경로를 제공한다는 것을 확인합니다.

실험 설정/결과

모델은 Surgical-VQLA[4]와 같이 짧은 구문 또는 단일 단어 응답이 의미론적 풍부함이 부족하거나 일부 LLM 강화 모델[17, 41]과 같이 맥락상 손실되기 때문에 규모에 맞게 의미 있는 응답을 생성하는 데 여전히 어려움을 겪고 있습니다.
설상가상으로, 감독된 미세 조정(SFT)은 일반화가 잘 안 되는 경우가 많으며[9, 44], 좁은 명령 템플릿에 과적합을 일으키는 경우가 많으며, 치명적인 망각을 유발하거나[15, 21] 심지어 모델 붕괴를 유발하여[27] 모델의 기본 추론 및 언어 능력을 저하시킬 수 있습니다.
다행스럽게도 실제로 QwenVL [3]과 같은 현재 VLM은 이미 강력한 시각적 기반을 보여주고 있으며 수술 장면의 위치, 기하학적 및 시간적 속성을 인식할 수 있습니다.
이러한 기존 능력을 활용하여 VLM을 수술 영역에 적용하기 위한 핵심 과제는 시각적 인식이 아니라 이러한 관찰을 임상적으로 의미 있는 수술 개념 및 용어에 매핑하는 능력을 얻는 것입니다.
이러한 통찰력은 중요한 질문을 제기합니다. 사전 훈련된 VLM을 적용하여 시각적 콘텐츠를 덮어쓰는 대신 기존 다중 모드 역량을 활용하면서 임상적으로 정확한 방식으로 시각적 콘텐츠를 표현할 수 있습니까?

한계/리스크

즉, 공개적으로 사용 가능한 수술 데이터 세트는 대부분 낮은 수준의 주석(예: 수술 단계, 기기 유형 또는 거친 동작과 같은 범주형 레이블)을 제공하는 반면 설명적이거나 추론이 풍부한 주석이 부족하고 제한된 언어적 다양성을 보여줍니다.
결과적으로 모델은 Surgical-VQLA[4]와 같이 짧은 구문 또는 단일 단어 응답이 의미론적 풍부함이 부족하거나 일부 LLM 강화 모델[17, 41]과 같이 맥락상 손실되기 때문에 의미 있는 응답을 생성하는 데 여전히 어려움을 겪고 있습니다.
설상가상으로, 감독된 미세 조정(SFT)은 일반화가 잘 안 되는 경우가 많으며[9, 44], 좁은 명령 템플릿에 과적합을 일으키는 경우가 많으며, 치명적인 망각을 유발하거나[15, 21] 심지어 모델 붕괴를 유발하여[27] 모델의 기본 추론 및 언어 능력을 저하시킬 수 있습니다.
다행스럽게도 실제로 QwenVL [3]과 같은 현재 VLM은 이미 강력한 시각적 기반을 보여주고 있으며 수술 장면의 위치, 기하학적 및 시간적 속성을 인식할 수 있습니다.

전체 논문 읽은 느낌 요약

이 문서에서는 VLM(비전 언어 모델)[3, 19, 35]의 최근 발전을 다루며 시각적 이해 및 다중 모달 추론 분야에서 상당한 발전을 이루었으며 모델이 복잡한 장면을 해석하고 일관된 자연 언어 응답을 생성할 수 있게 되었습니다. 핵심 방법은 사전 훈련된 다중 모달 사전 분석으로 일반화가 줄어듭니다. 주요 경험적 연구 결과에는 Surgical-VQLA[4]와 같이 짧은 문구 또는 단일 단어 응답이 의미론적 풍부함이 부족하거나 일부 LLM 강화 모델[17, 41]과 같이 맥락상 손실되기 때문에 모델이 여전히 의미 있는 응답을 생성하는 데 어려움을 겪고 있다는 점을 포함합니다.