#9 PerceptionComp: A Video Benchmark for Complex Perception-Centric Reasoning

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles (b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning. For deep video understanding, this should not mean only longer language-side thinking; it should also mean composing multiple perception skills and repeatedly revisiting the video to gather visual information across different dimensions. (a) An example from PerceptionComp, where models are required to perform complex, perception-centric reasoning with various types of subconditions to arrive at the final answer.

The core proposal is State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%. Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed. We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perceptioncentric video reasoning. PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning.

The empirical case is built around State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%. confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time. State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%. (b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning.

The central reported finding is confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time. State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%. (b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning. 1 Introduction Videos capture human activities and the physical world, and multimodal intelligence—from robots to AI glasses—must achieve deep video understanding to be broadly useful.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time.
Most important supporting result: State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.

Problem definition

(b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning.
For deep video understanding, this should not mean only longer language-side thinking; it should also mean composing multiple perception skills and repeatedly revisiting the video to gather visual information across different dimensions.
(a) An example from PerceptionComp, where models are required to perform complex, perception-centric reasoning with various types of subconditions to arrive at the final answer.
Videos capture human activities and the physical world, and multimodal intelligence—from robots to AI glasses—must achieve deep video understanding to be broadly useful.

Core idea & method

State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%.
Human studies show that PerceptionComp requires substantial test-time thinking and repeated perception steps: participants take much longer than on prior benchmarks, and accuracy drops to near chance (18.97%) when rewatching is disallowed.
We introduce PerceptionComp, a manually annotated benchmark for complex, long-horizon, perceptioncentric video reasoning.
PerceptionComp is designed so that no single moment is sufficient: answering each question requires multiple temporally separated pieces of visual evidence and compositional constraints under conjunctive and sequential logic, spanning perceptual subtasks such as objects, attributes, relations, locations, actions, and events, and requiring skills including semantic recognition, visual correspondence, temporal reasoning, and spatial reasoning.

Actual findings

confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time.
State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.

How the conclusion was reached

Step 1 — Proposed approach: State-of-the-art MLLMs also perform substantially worse on PerceptionComp than on existing benchmarks: the best model in our evaluation, Gemini-3-Flash, reaches only 45.96% accuracy in the five-choice setting, while open-source models remain below 40%.
Step 2 — Evaluation setup or comparison basis: State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.
Step 3 — Main reported evidence: confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time.
Step 4 — Additional supporting or qualifying result: State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.

Experimental setup & results

confirms the intended difficulty: PerceptionComp requires substantially longer response times than prior benchmarks, and under a single-view setting (no rewatching) human accuracy drops to near chance (18.97%), while experts can reach 100% accuracy with unrestricted rewatching and sufficient time.
State-of-the-art MLLMs perform notably worse: the best model in our evaluation (Gemini-3-Flash) reaches only 45.96% accuracy, and open-source MLLMs remain below 40%.
(b) Results from a human study measuring question-answering time, showing that PerceptionComp is more challenging for humans than previous perception and reasoning video benchmarks, largely due to its emphasis on perception-centric reasoning.
1 Introduction Videos capture human activities and the physical world, and multimodal intelligence—from robots to AI glasses—must achieve deep video understanding to be broadly useful.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 논문에서는 (b) 질문 답변 시간을 측정한 인간 연구 결과를 다루며, PerceptionComp가 주로 지각 중심 추론을 강조하기 때문에 이전의 인식 및 추론 비디오 벤치마크보다 인간에게 더 어렵다는 것을 보여줍니다. 영상에 대한 심층적인 이해를 위해서는 더 긴 언어적 측면의 사고만을 의미해서는 안 됩니다. 이는 또한 다양한 인식 기술을 구성하고 비디오를 반복적으로 다시 방문하여 다양한 차원에 걸쳐 시각적 정보를 수집하는 것을 의미해야 합니다. (a) 모델이 최종 답변에 도달하기 위해 다양한 유형의 하위 조건을 사용하여 복잡하고 인식 중심의 추론을 수행해야 하는 PerceptionComp의 예입니다. 핵심 제안은 최첨단 MLLM이 기존 벤치마크보다 PerceptionComp에서 훨씬 더 나쁜 성능을 발휘한다는 것입니다. 평가에서 가장 좋은 모델인 Gemini-3-Flash는 5가지 선택 설정에서 정확도가 45.96%에 불과한 반면 오픈 소스 모델은 40% 미만으로 유지됩니다. 인간 연구에 따르면 PerceptionComp에는 상당한 테스트 시간 사고와 반복적인 인식 단계가 필요합니다. 참가자는 이전 벤치마크보다 훨씬 오랜 시간이 걸리고 다시 시청이 허용되지 않으면 정확도가 거의 확률(18.97%)로 떨어집니다. 복잡하고 장기적인 인식 중심 비디오 추론을 위해 수동으로 주석을 추가한 벤치마크인 PerceptionComp를 소개합니다. PerceptionComp는 한 순간만으로는 충분하지 않도록 설계되었습니다. 각 질문에 답하려면 객체, 속성, 관계, 위치, 작업 및 이벤트와 같은 지각 하위 작업을 포괄하는 결합 및 순차 논리 하에서 시간적으로 분리된 여러 시각적 증거와 구성 제약이 필요하며 의미 인식, 시각적 대응, 시간 추론 및 공간 추론을 포함한 기술이 필요합니다. 경험적 사례는 최첨단 MLLM의 성능이 현저히 떨어지는 것을 중심으로 구축되었습니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다. 의도된 난이도를 확인합니다. PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 뷰 설정(다시 시청 없음)에서 인간의 정확도는 거의 확률(18.97%)로 떨어지는 반면, 전문가는 무제한의 다시 시청 및 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다. 최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다. (b) 질문 답변 시간을 측정한 인간 연구 결과, PerceptionComp는 주로 지각 중심 추론을 강조하기 때문에 이전의 인식 및 추론 비디오 벤치마크보다 인간에게 더 어려운 것으로 나타났습니다. 중앙 보고 결과는 의도된 어려움을 확인시켜 줍니다. PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 보기 설정(다시 관찰 없음)에서 인간의 정확도는 거의 우연(18.97%)으로 떨어지는 반면, 전문가는 제한 없는 다시 관찰과 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다. 최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다. (b) 인간 연구 결과 질문 응답 시간을 측정하여 PerceptionComp가 이전의 인식 및 추론 비디오 벤치마크보다 인간에게 더 어렵다는 것을 보여줍니다. 이는 주로 인식 중심 추론에 중점을 두기 때문입니다. 1 소개 비디오는 인간 활동과 물리적 세계를 포착하며, 로봇에서 AI 안경에 이르기까지 다양한 모드의 지능이 광범위하게 유용하려면 깊은 비디오 이해를 달성해야 합니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 의도된 난이도 확인: PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 보기 설정(다시 관찰 없음)에서 인간의 정확도는 거의 확률(18.97%)로 떨어지는 반면, 전문가는 무제한의 다시 관찰과 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다.
가장 중요한 지원 결과: 최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다.

문제 정의

(b) 질문 답변 시간을 측정한 인간 연구 결과, PerceptionComp는 주로 지각 중심 추론을 강조하기 때문에 이전의 인식 및 추론 비디오 벤치마크보다 인간에게 더 어려운 것으로 나타났습니다.
영상에 대한 심층적인 이해를 위해서는 더 긴 언어적 측면의 사고만을 의미해서는 안 됩니다. 이는 또한 다양한 인식 기술을 구성하고 비디오를 반복적으로 다시 방문하여 다양한 차원에 걸쳐 시각적 정보를 수집하는 것을 의미해야 합니다.
(a) 모델이 최종 답변에 도달하기 위해 다양한 유형의 하위 조건을 사용하여 복잡하고 인식 중심의 추론을 수행해야 하는 PerceptionComp의 예입니다.
비디오는 인간 활동과 물리적 세계를 포착하며, 로봇에서 AI 안경에 이르기까지 다양한 모드의 지능이 광범위하게 유용하려면 심층적인 비디오 이해를 달성해야 합니다.

핵심 아이디어/방법

최첨단 MLLM은 기존 벤치마크보다 PerceptionComp에서 훨씬 더 낮은 성능을 발휘합니다. 평가에서 가장 좋은 모델인 Gemini-3-Flash는 5가지 선택 설정에서 정확도가 45.96%에 불과한 반면, 오픈 소스 모델은 40% 미만으로 유지됩니다.
인간 연구에 따르면 PerceptionComp에는 상당한 테스트 시간 사고와 반복적인 인식 단계가 필요합니다. 참가자는 이전 벤치마크보다 훨씬 오랜 시간이 걸리고 다시 시청이 허용되지 않으면 정확도가 거의 확률(18.97%)로 떨어집니다.
복잡하고 장기적인 인식 중심 비디오 추론을 위해 수동으로 주석을 추가한 벤치마크인 PerceptionComp를 소개합니다.
PerceptionComp는 한 순간만으로는 충분하지 않도록 설계되었습니다. 각 질문에 답하려면 객체, 속성, 관계, 위치, 작업 및 이벤트와 같은 지각 하위 작업을 포괄하는 결합 및 순차 논리 하에서 시간적으로 분리된 여러 시각적 증거와 구성 제약이 필요하며 의미 인식, 시각적 대응, 시간 추론 및 공간 추론을 포함한 기술이 필요합니다.

실제 결과

의도된 난이도를 확인합니다. PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 뷰 설정(다시 시청 없음)에서 인간의 정확도는 거의 확률(18.97%)로 떨어지는 반면, 전문가는 무제한의 다시 시청 및 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다.
최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 최첨단 MLLM은 기존 벤치마크보다 PerceptionComp에서 훨씬 더 나쁜 성능을 발휘합니다. 평가에서 가장 좋은 모델인 Gemini-3-Flash는 5가지 선택 설정에서 정확도가 45.96%에 불과한 반면, 오픈 소스 모델은 40% 미만으로 유지됩니다.
2단계 — 평가 설정 또는 비교 기준: 최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다.
3단계 — 보고된 주요 증거: 의도된 난이도 확인: PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 보기 설정(다시 관찰 없음)에서 인간의 정확도는 거의 확률(18.97%)로 떨어지는 반면, 전문가는 제한 없는 다시 관찰과 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다.
4단계 — 추가 지원 또는 적격 결과: 최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다.

실험 설정/결과

의도된 난이도를 확인합니다. PerceptionComp는 이전 벤치마크보다 훨씬 더 긴 응답 시간을 요구하며 단일 뷰 설정(다시 시청 없음)에서 인간의 정확도는 거의 확률(18.97%)로 떨어지는 반면, 전문가는 무제한의 다시 시청 및 충분한 시간을 통해 100% 정확도에 도달할 수 있습니다.
최첨단 MLLM의 성능은 눈에 띄게 나쁩니다. 평가에서 가장 좋은 모델(Gemini-3-Flash)은 정확도가 45.96%에 불과하고 오픈 소스 MLLM은 40% 미만으로 유지됩니다.
(b) 질문 답변 시간을 측정한 인간 연구 결과, PerceptionComp는 주로 지각 중심 추론을 강조하기 때문에 이전의 인식 및 추론 비디오 벤치마크보다 인간에게 더 어려운 것으로 나타났습니다.
1 소개 비디오는 인간 활동과 물리적 세계를 포착하며, 로봇에서 AI 안경에 이르기까지 다양한 모드의 지능이 광범위하게 유용하려면 깊은 비디오 이해를 달성해야 합니다.