#9 VideoSeek: Long-Horizon Video Agent with Tool-Guided Seeking

Detailed Summary (EN)

Problem definition

Video-language understanding [4, 22, 34, 63] requires perceiving and reasoning over the video streams and the natural-language instructions to interpret user intent.
Its practical impact spans a wide range of applications, including multimodal assistants [42, 59], autonomous driving [20, 52], and vision-guided robotics [48, 74].
Recent advancements in large language models (LLMs) [1, 3, 40, 53] and large multimodal models (LMMs) [23, 31, 55] have successfully pushed the limits of video-language understanding, encouraging a surge of Video-LMMs [27, 30, 47, 50, 65, 72, 75] that achieve promising performance on standard video-language tasks [6, 21, 28, 44, 49, 63].
However, these methods mostly follow a single-pass paradigm, which is often insufficient for more challenging settings, such as long-form video understanding [5, 12, 56, 62] and complex video reasoning [9, 12].

Core idea & method

to use far fewer frames while maintaining, or even improving, its video understanding capability.
VideoSeek operates in a think–act–observe loop with a welldesigned toolkit for collecting multi-granular video observations.
This design enables query-aware exploration over accumulated observations and supports practical video understanding and reasoning.
Experiments on four challenging video understanding and reasoning benchmarks demonstrate that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs.
Notably, VideoSeek achieves a 10.2 absolute points improvement on LVBench over its base model, GPT-5, while using 93% fewer frames.

Experimental setup & results

Early video agentic approaches rely on manually designed and human-crafted workflows.
VideoAgent [57] pioneers using an LLM as a central agent that iteratively inspects key video frames and then retrieves query-relevant frames via CLIP [43].
Subsequent works [58, 67] refine this idea by performing a coarse-tofine, tree-structured search over video segments to identify informative frames.
Beyond pure search, later studies [32, 33, 41] construct a comprehensive video database for query-relevant information retrieval.
Instead of relying on predefined workflows, recent studies [51, 71] develop autonomous and adaptive agentic paradigms with tool use for diverse and real-world scenarios.

Limitations & risks

We present VideoSeek, a long-horizon video agent that leverages video logic flow to actively seek answer-critical evidence instead of exhaustively parsing the full video.
Through a lightweight multi-granular toolkit and a think– act–observe loop, VideoSeek adaptively navigates to informative video segments by reasoning over accumulated observations.
Experiments on four challenging benchmarks spanning both long-form video understanding and complex video reasoning show that VideoSeek achieves strong accuracy while using far fewer frames than prior video agents and standalone LMMs.
Further analysis highlights the importance of video logic flow, strong reasoning capability, and the complementary design of the toolkit.

Read-like-fullpaper digest

This paper addresses Video-language understanding [4, 22, 34, 63] requires perceiving and reasoning over the video streams and the natural-language instructions to interpret user intent. The core method is to use far fewer frames while maintaining, or even improving, its video understanding capability. Key empirical findings include Early video agentic approaches rely on manually designed and human-crafted workflows.

상세 요약 (KO)

문제 정의

비디오 언어 이해 [4, 22, 34, 63]는 사용자 의도를 해석하기 위해 비디오 스트림과 자연어 지침에 대한 인식과 추론이 필요합니다.
실질적인 영향은 다중 모드 보조 장치[42, 59], 자율 주행[20, 52] 및 비전 유도 로봇 공학[48, 74]을 포함한 광범위한 응용 분야에 걸쳐 있습니다.
대규모 언어 모델(LLM) [1, 3, 40, 53] 및 대규모 다중 모드 모델(LMM) [23, 31, 55]의 최근 발전으로 비디오 언어 이해의 한계를 성공적으로 확장하여 표준 비디오 언어 작업에서 유망한 성능을 달성하는 Video-LMM [27, 30, 47, 50, 65, 72, 75]이 급증했습니다. 28, 44, 49, 63].
그러나 이러한 방법은 대부분 단일 패스 패러다임을 따르는데, 이는 긴 형식 비디오 이해[5, 12, 56, 62] 및 복잡한 비디오 추론[9, 12]과 같은 보다 까다로운 설정에는 종종 충분하지 않습니다.

핵심 아이디어/방법

비디오 이해 기능을 유지하거나 향상시키면서 훨씬 적은 수의 프레임을 사용합니다.
VideoSeek는 다층적인 비디오 관찰을 수집하기 위해 잘 설계된 툴킷을 사용하여 생각-행동-관찰 루프로 작동합니다.
이 디자인은 누적된 관찰에 대한 질의 인식 탐색을 가능하게 하고 실용적인 영상 이해 및 추론을 지원합니다.
네 가지 까다로운 비디오 이해 및 추론 벤치마크에 대한 실험에서는 VideoSeek가 이전 비디오 에이전트 및 독립형 LMM보다 훨씬 적은 프레임을 사용하면서 강력한 정확성을 달성했음을 보여줍니다.
특히 VideoSeek는 기본 모델인 GPT-5에 비해 LVBench에서 10.2 절대 포인트 향상을 달성하는 동시에 93% 더 적은 프레임을 사용합니다.

실험 설정/결과

초기 비디오 에이전트 접근 방식은 수동으로 설계되고 사람이 만든 워크플로에 의존합니다.
VideoAgent[57]는 LLM을 주요 비디오 프레임을 반복적으로 검사한 다음 CLIP[43]을 통해 쿼리 관련 프레임을 검색하는 중앙 에이전트로 사용하는 방법을 개척했습니다.
후속 작업[58, 67]은 정보 프레임을 식별하기 위해 비디오 세그먼트에 대해 대략적에서 미세한 트리 구조 검색을 수행하여 이 아이디어를 개선합니다.
순수한 검색을 넘어 이후 연구[32, 33, 41]에서는 쿼리 관련 정보 검색을 위한 포괄적인 비디오 데이터베이스를 구축합니다.
사전 정의된 워크플로에 의존하는 대신 최근 연구[51, 71]에서는 다양한 실제 시나리오에 대한 도구 사용을 통해 자율적이고 적응 가능한 에이전트 패러다임을 개발합니다.

한계/리스크

전체 비디오를 철저하게 분석하는 대신 비디오 로직 흐름을 활용하여 답변에 중요한 증거를 적극적으로 찾는 장거리 비디오 에이전트인 VideoSeek를 소개합니다.
VideoSeek는 경량의 다중 세분화 툴킷과 생각-행동-관찰 루프를 통해 축적된 관찰 내용을 추론하여 유익한 비디오 세그먼트로 적응적으로 탐색합니다.
긴 형식의 비디오 이해와 복잡한 비디오 추론을 포괄하는 4가지 까다로운 벤치마크에 대한 실험에서는 VideoSeek가 이전 비디오 에이전트 및 독립형 LMM보다 훨씬 적은 프레임을 사용하면서 강력한 정확성을 달성한 것으로 나타났습니다.
추가 분석에서는 비디오 로직 흐름, 강력한 추론 기능, 툴킷의 보완적 설계의 중요성이 강조됩니다.

전체 논문 읽은 느낌 요약

이 논문은 비디오 언어 이해 [4, 22, 34, 63]를 다루며 사용자 의도를 해석하기 위해 비디오 스트림과 자연어 지침에 대한 인식과 추론이 필요합니다. 핵심 방법은 비디오 이해 기능을 유지하거나 향상시키면서 훨씬 적은 프레임을 사용하는 것입니다. 주요 경험적 발견에는 다음이 포함됩니다. 초기 비디오 에이전트 접근 방식은 수동으로 설계되고 인간이 만든 워크플로에 의존합니다.