#9 A Comparative Study in Surgical AI: Datasets, Foundation Models, and Barriers to Med-AGI

Score: 17.6 | Matched keywords: ai, artificial intelligence, foundation models, multimodal

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Non-expert humans excel at this task: annotators in our study learned to label these tools with near-perfect accuracy after minimal training. [2025] further demonstrate that “generalist” radiology capability depends on large-scale in-domain pretraining and radiology-specific instruction tuning, suggesting progress toward Med-AGI may be bottlenecked by domain data coverage as much as by parameter count. [2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties.

The core proposal is Moreover, some obstacles cannot be simply “scaled away” with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors. of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year. In this paper, we explore this question through a case study of surgical tool detection using state-ofthe-art AI methods available in 2026. On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources.

The empirical case is built around [2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties. The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%. [2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties. While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift.

The central reported finding is The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%. While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift. The fine-tuned open-weight model and YOLOv12-m outperform all zero-shot VLM methods including zeroshot methods using proprietary frontier VLMs.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%.
Most important supporting result: While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift.

Problem definition

Non-expert humans excel at this task: annotators in our study learned to label these tools with near-perfect accuracy after minimal training.
[2025] further demonstrate that “generalist” radiology capability depends on large-scale in-domain pretraining and radiology-specific instruction tuning, suggesting progress toward Med-AGI may be bottlenecked by domain data coverage as much as by parameter count.
[2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties.
The definition of AGI remains debated, but, in order to function in the operative setting, locating and classifying surgical instruments is the earliest (necessary, not sufficient) relevant task.

Core idea & method

Moreover, some obstacles cannot be simply “scaled away” with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors.
of scaling architecture size and training data is attractive, especially since there are millions of hours of surgical video data generated per year.
In this paper, we explore this question through a case study of surgical tool detection using state-ofthe-art AI methods available in 2026.
On the other hand, preparing surgical data for AI training requires significantly higher levels of professional expertise, and training on that data requires expensive computational resources.
We demonstrate that even with multi-billion parameter models and extensive training, current Vision Language Models fall short in the seemingly simple task of tool detection in neurosurgery.
Additionally, we show scaling experiments indicating that increasing model size and training time only leads to diminishing improvements in relevant performance metrics.

Actual findings

The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%.
While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift.

How the conclusion was reached

Step 1 — Proposed approach: Moreover, some obstacles cannot be simply “scaled away” with additional compute and persist across diverse model architectures, raising the question of whether data and label availability are the only limiting factors.
Step 2 — Evaluation setup or comparison basis: [2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties.
Step 3 — Main reported evidence: The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%.
Step 4 — Additional supporting or qualifying result: While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift.

Experimental setup & results

The model achieves 47.63% exact match accuracy, surpassing the validation set baseline of 13.41%.
[2024] present Med-Gemini, a family of models achieving 91.1% on MedQA and large gains over GPT-4V on multimodal benchmarks, as evidence that large multimodal foundation models can deliver strong generalist capabilities across medical specialties.
While training accuracy reaches 98.6%, validation accuracy remains below 40%, showing that scaling alone cannot overcome distribution shift.
The fine-tuned open-weight model and YOLOv12-m outperform all zero-shot VLM methods including zeroshot methods using proprietary frontier VLMs.
Such benchmark results have fueled speculation about the feasibility of a “Medical Artificial General Intelligence” (Med-AGI) through scaling.
[2024] find that state-of-theart LLMs perform significantly worse than physicians across pathologies, often failing to follow instructions.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 비전문가가 이 작업을 잘 수행하는 방법을 다루고 있습니다. 우리 연구의 주석자는 최소한의 교육 후에 이러한 도구에 거의 완벽한 정확도로 레이블을 지정하는 방법을 배웠습니다. [2025]는 "일반" 방사선학 기능이 대규모 도메인 내 사전 훈련 및 방사선학 관련 지침 조정에 달려 있음을 추가로 입증하여 Med-AGI를 향한 진행이 매개변수 수만큼 도메인 데이터 범위에 의해 병목 현상이 발생할 수 있음을 시사합니다. [2024]는 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V보다 큰 이득을 달성한 모델 제품군인 Med-Gemini를 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 제시합니다. 핵심 제안은 또한 일부 장애물은 추가 컴퓨팅을 통해 단순히 "확장"할 수 없으며 다양한 모델 아키텍처에 걸쳐 지속되므로 데이터 및 레이블 가용성이 유일한 제한 요소인지에 대한 의문을 제기합니다. 특히 연간 수백만 시간의 수술 비디오 데이터가 생성되기 때문에 아키텍처 크기 및 교육 데이터 확장이 매력적입니다. 본 논문에서는 2026년에 사용 가능한 최첨단 AI 방법을 사용한 수술 도구 감지 사례 연구를 통해 이 질문을 탐구합니다. 반면, AI 훈련을 위한 수술 데이터를 준비하려면 훨씬 더 높은 수준의 전문 지식이 필요하고 해당 데이터에 대한 훈련에는 값비싼 컴퓨팅 리소스가 필요합니다. 실증적 사례는 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V보다 큰 이득을 달성한 모델 제품군인 Med-Gemini를 제시하는 [2024]를 중심으로 구축되었습니다. 이 모델은 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 능가합니다. [2024]는 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V보다 큰 이득을 달성한 모델 제품군인 Med-Gemini를 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 제시합니다. 훈련 정확도는 98.6%에 도달하지만 검증 정확도는 40% 미만으로 유지됩니다. 이는 조정만으로는 분포 변화를 극복할 수 없음을 보여줍니다. 보고된 중앙 결과는 모델이 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 초과한다는 것입니다. 훈련 정확도는 98.6%에 도달하지만 검증 정확도는 40% 미만으로 유지됩니다. 이는 조정만으로는 분포 변화를 극복할 수 없음을 보여줍니다. 미세 조정된 개방형 가중치 모델과 YOLOv12-m은 독점 프론티어 VLM을 사용하는 제로샷 방법을 포함한 모든 제로샷 VLM 방법보다 성능이 뛰어납니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 이 모델은 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 능가합니다.
가장 중요한 뒷받침 결과: 훈련 정확도는 98.6%에 도달하지만 검증 정확도는 40% 미만으로 유지되어 확장만으로는 분포 변화를 극복할 수 없음을 보여줍니다.

문제 정의

전문가가 아닌 사람이 이 작업에 탁월합니다. 우리 연구의 주석자는 최소한의 교육 후에 이러한 도구에 거의 완벽한 정확도로 레이블을 지정하는 방법을 배웠습니다.
[2025]는 "일반" 방사선학 기능이 대규모 도메인 내 사전 훈련 및 방사선학 관련 지침 조정에 달려 있음을 추가로 입증하여 Med-AGI를 향한 진행이 매개변수 수만큼 도메인 데이터 범위에 의해 병목 현상이 발생할 수 있음을 시사합니다.
[2024]는 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V보다 큰 이득을 달성한 모델 제품군인 Med-Gemini를 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 제시합니다.
AGI의 정의는 여전히 논쟁의 여지가 있지만, 수술 환경에서 기능하기 위해서는 수술 도구를 찾아 분류하는 것이 가장 먼저(필요하지만 충분하지는 않은) 관련 작업입니다.

핵심 아이디어/방법

더욱이 일부 장애물은 추가 컴퓨팅을 통해 단순히 "확장"할 수 없으며 다양한 모델 아키텍처에 걸쳐 지속될 수 없으므로 데이터 및 라벨 가용성이 유일한 제한 요소인지에 대한 의문이 제기됩니다.
특히 연간 수백만 시간의 수술 비디오 데이터가 생성되기 때문에 아키텍처 크기 및 교육 데이터 확장이 매력적입니다.
본 논문에서는 2026년에 이용 가능한 최첨단 AI 방법을 사용한 수술 도구 감지 사례 연구를 통해 이 질문을 탐구합니다.
반면, AI 훈련을 위한 수술 데이터를 준비하려면 훨씬 더 높은 수준의 전문 지식이 필요하며, 해당 데이터에 대한 훈련에는 값비싼 컴퓨팅 리소스가 필요합니다.
우리는 수십억 개의 매개변수 모델과 광범위한 훈련을 통해서도 현재의 비전 언어 모델이 신경외과 분야의 도구 감지라는 겉으로는 단순해 보이는 작업에 부족하다는 것을 보여줍니다.
또한 모델 크기와 훈련 시간이 증가하면 관련 성능 지표의 개선이 줄어드는 것을 나타내는 확장 실험을 보여줍니다.

실제 결과

이 모델은 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 능가합니다.
훈련 정확도는 98.6%에 도달하지만 검증 정확도는 40% 미만으로 유지됩니다. 이는 조정만으로는 분포 변화를 극복할 수 없음을 보여줍니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 또한 일부 장애물은 추가 컴퓨팅을 통해 단순히 "확장"할 수 없으며 다양한 모델 아키텍처에 걸쳐 지속되므로 데이터 및 라벨 가용성이 유일한 제한 요소인지에 대한 의문이 제기됩니다.
2단계 - 평가 설정 또는 비교 기준: [2024]는 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V에 비해 큰 이득을 얻은 모델 제품군인 Med-Gemini를 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 제시합니다.
3단계 - 보고된 주요 증거: 모델은 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 능가합니다.
4단계 — 추가 지원 또는 적격 결과: 훈련 정확도가 98.6%에 도달하는 동안 검증 정확도는 40% 미만으로 유지되어 확장만으로는 분포 변화를 극복할 수 없음을 보여줍니다.

실험 설정/결과

이 모델은 47.63%의 정확한 일치 정확도를 달성하여 검증 세트 기준인 13.41%를 능가합니다.
[2024]는 MedQA에서 91.1%를 달성하고 다중 모드 벤치마크에서 GPT-4V보다 큰 이득을 달성한 모델 제품군인 Med-Gemini를 대규모 다중 모드 기반 모델이 의료 전문 분야 전반에 걸쳐 강력한 일반 기능을 제공할 수 있다는 증거로 제시합니다.
훈련 정확도는 98.6%에 도달하지만 검증 정확도는 40% 미만으로 유지됩니다. 이는 조정만으로는 분포 변화를 극복할 수 없음을 보여줍니다.
미세 조정된 개방형 가중치 모델과 YOLOv12-m은 독점 프론티어 VLM을 사용하는 제로샷 방법을 포함한 모든 제로샷 VLM 방법보다 성능이 뛰어납니다.
이러한 벤치마크 결과는 스케일링을 통한 '의료용 인공 일반 지능(Med-AGI)'의 타당성에 대한 추측을 불러일으켰습니다.
[2024] 최신 LLM은 병리학 전반에 걸쳐 의사보다 성능이 훨씬 나쁘고 종종 지침을 따르지 않는 것으로 나타났습니다.