#10 CoVR-R:Reason-Aware Composed Video Retrieval

Detailed Summary (EN)

Problem definition

In composed video retrieval (CoVR), a system receives a reference video and a short modification text and must return a target video that reflects the requested change.
In practice, many modifications imply additional, unspoken after†Equal contribution.
effects causal and temporal consequences such as object state transitions (“ingredients become browned”), motion and shot-scale changes (“close-up implies tighter framing and shorter duration”)[19], or scene dynamics (“frying introduces smoke and faster hand motions”).
Treating these after-effects as mere keywords underestimates the gap between what is said (the edit) and what must happen (its visual consequences).

Core idea & method

that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning.
To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching.
Our automatic and human analysis confirm higher step-consistency and effect-factuality in our retrieved

Experimental setup & results

Our findings suggest that general-purpose LMM reasoning is an effective driver for CoVR, reducing the need for task-specific supervision and opening a path toward more explainable video search.
Introduction In composed video retrieval (CoVR), a system receives a reference video and a short modification text and must return a target video that reflects the requested change.
In practice, many modifications imply additional, unspoken after†Equal contribution.
effects causal and temporal consequences such as object state transitions (“ingredients become browned”), motion and shot-scale changes (“close-up implies tighter framing and shorter duration”)[19], or scene dynamics (“frying introduces smoke and faster hand motions”).
Treating these after-effects as mere keywords underestimates the gap between what is said (the edit) and what must happen (its visual consequences).

Limitations & risks

Prior work assumes the modification text fully specifies the visual changes, overlooking after-effects implicit consequences (e.g., motion, state transitions, viewpoint/duration cues) that emerge from the edit.
We argue that successful CoVR requires reasoning about these after-effects.
We introduce a reasoning-first, zero-shot approach that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning.
To evaluate reasoning in CoVR, we also propose CoVR-Reason, a benchmark that pairs each (reference, edit, target) triplet with structured internal reasoning traces and challenging distractors that require predicting after-effects rather than keyword matching.

Read-like-fullpaper digest

This paper addresses In composed video retrieval (CoVR), a system receives a reference video and a short modification text and must return a target video that reflects the requested change. The core method is that leverages large multimodal models to (i) infer causal and temporal consequences implied by the edit, and (ii) align the resulting reasoned queries to candidate videos without task-specific finetuning. Key empirical findings include Our findings suggest that general-purpose LMM reasoning is an effective driver for CoVR, reducing the need for task-specific supervision and opening a path toward more explainable video search.

상세 요약 (KO)

문제 정의

CoVR(작성된 비디오 검색)에서 시스템은 참조 비디오와 짧은 수정 텍스트를 수신하고 요청된 변경 사항을 반영하는 대상 비디오를 반환해야 합니다.
실제로 많은 수정 사항은 †동등 기여 이후에 추가, 무언의 내용을 의미합니다.
개체 상태 전환("재료가 갈색으로 변함"), 모션 및 샷 크기 변경("클로즈업은 프레임이 더 조밀해지고 지속 시간이 짧아짐을 의미함")[19] 또는 장면 역학("튀김은 연기를 유발하고 손 움직임이 더 빨라짐")과 같은 인과적 및 시간적 결과에 영향을 미칩니다.
이러한 여파를 단순한 키워드로 취급하면 말한 내용(편집)과 발생해야 하는 내용(시각적 결과) 사이의 격차를 과소평가합니다.

핵심 아이디어/방법

대규모 다중 모드 모델을 활용하여 (i) 편집에 의해 암시된 인과 및 시간적 결과를 추론하고 (ii) 작업별 미세 조정 없이 결과적으로 추론된 쿼리를 후보 비디오에 얼라인먼트합니다.
CoVR의 추론을 평가하기 위해 우리는 각(참조, 편집, 대상) 삼중항을 구조화된 내부 추론 추적 및 키워드 일치보다는 후유증 예측이 필요한 도전적인 선택 요소와 쌍을 이루는 벤치마크인 CoVR-Reason도 제안합니다.
우리의 자동 및 인간 분석은 검색된 데이터에서 더 높은 단계 일관성과 효과 사실성을 확인합니다.

실험 설정/결과

우리의 연구 결과는 범용 LMM 추론이 CoVR의 효과적인 동인이며 작업별 감독의 필요성을 줄이고 보다 설명 가능한 비디오 검색을 향한 길을 열어준다는 것을 시사합니다.
소개 CoVR(작성된 비디오 검색)에서 시스템은 참조 비디오와 짧은 수정 텍스트를 수신하고 요청된 변경 사항을 반영하는 대상 비디오를 반환해야 합니다.
실제로 많은 수정 사항은 †동등 기여 이후에 추가, 무언의 내용을 의미합니다.
개체 상태 전환("재료가 갈색으로 변함"), 모션 및 샷 크기 변경("클로즈업은 프레임이 더 조밀해지고 지속 시간이 짧아짐을 의미함")[19] 또는 장면 역학("튀김은 연기를 유발하고 손 움직임이 더 빨라짐")과 같은 인과적 및 시간적 결과에 영향을 미칩니다.
이러한 여파를 단순한 키워드로 취급하면 말한 내용(편집)과 발생해야 하는 내용(시각적 결과) 사이의 격차를 과소평가합니다.

한계/리스크

이전 작업에서는 수정 텍스트가 시각적 변경 사항을 완전히 지정하고 편집에서 나타나는 후유증 암시적 결과(예: 동작, 상태 전환, 시점/지속 시간 단서)를 간과한다고 가정합니다.
우리는 성공적인 CoVR을 위해서는 이러한 후유증에 대한 추론이 필요하다고 주장합니다.
우리는 대규모 멀티모달 모델을 활용하여 (i) 편집에 의해 암시된 인과 및 시간적 결과를 추론하고 (ii) 작업별 미세 조정 없이 결과 추론 쿼리를 후보 비디오에 얼라인먼트하는 추론 우선 제로 샷 접근 방식을 소개합니다.
CoVR의 추론을 평가하기 위해 우리는 각(참조, 편집, 대상) 삼중항을 구조화된 내부 추론 추적 및 키워드 일치보다는 후유증 예측이 필요한 도전적인 선택 요소와 쌍을 이루는 벤치마크인 CoVR-Reason도 제안합니다.

전체 논문 읽은 느낌 요약

이 문서에서는 CoVR(작성된 비디오 검색)에서 시스템이 참조 비디오와 짧은 수정 텍스트를 수신하고 요청된 변경 사항을 반영하는 대상 비디오를 반환해야 합니다. 핵심 방법은 대규모 다중 모드 모델을 활용하여 (i) 편집에 의해 암시된 인과 및 시간적 결과를 추론하고 (ii) 작업별 미세 조정 없이 결과적으로 추론된 쿼리를 후보 비디오에 얼라인먼트하는 것입니다. 주요 실증적 연구 결과는 다음과 같습니다. 우리의 연구 결과는 범용 LMM 추론이 CoVR의 효과적인 동인이며 작업별 감독의 필요성을 줄이고 보다 설명 가능한 비디오 검색을 향한 길을 열어준다는 것을 시사합니다.