#5 ETA-VLA: Efficient Token Adaptation via Temporal Fusion and Intra-LLM Sparsification for Vision-Language-Action Models

Score: 20.0 | Matched keywords: benchmark, large language models, llm, reasoning, token

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Despite their potential, deploying VLA models in realworld driving environments faces a critical challenge: the efficiency-accuracy trade-off in processing high-dimensional spatiotemporal data. For large vision-language models (VLMs), SparseVLM [25], and MADTP [2] leverage cross-modal alignment or textual instructions to guide visual token pruning, yet they either operate outside the LLM or treat temporal frames as independent images, without explicitly modeling multi-view, multi-frame temporal dependencies within the LLM. Our key insight is that efficiency must be addressed at both the temporal and spatial levels: historical observations contain significant redundancy, and even after temporal compression, the resulting spatial representation remains computationally intensive for the LLM.

The core proposal is Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark. To alleviate this bottleneck, we propose ETAVLA, an Efficient Token Adaptation framework for VLA models. Despite their potential, deploying VLA models in realworld driving environments faces a critical challenge: the efficiency-accuracy trade-off in processing high-dimensional spatiotemporal data. Standard VLA architectures typically flatten visual tokens from multi-view history and concatenate them with text prompts.

The empirical case is built around Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark. We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines. HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs. However, it relies on the standard causal attention matrix, which is influenced by positional encoding and may not fully capture pure semantic relevance, thus failing to achieve the human-like, contextaware attention allocation required for safe driving.

The central reported finding is We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines. HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs. However, it relies on the standard causal attention matrix, which is influenced by positional encoding and may not fully capture pure semantic relevance, thus failing to achieve the human-like, contextaware attention allocation required for safe driving. SparseVLM [25] designs a trainingfree framework using text tokens as “raters” to score visual tokens via self-attention matrices.

The paper also makes it clear that We introduced ETA-VLA, a novel Vision-LanguageAction (VLA) model designed to address the computational challenges of multi-frame, multi-view input in autonomous driving. This work paves the way for deploying highreasoning VLA models on resource-constrained automotive hardware. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines.
Most important supporting result: HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs.
Important caution: We introduced ETA-VLA, a novel Vision-LanguageAction (VLA) model designed to address the computational challenges of multi-frame, multi-view input in autonomous driving.

Problem definition

Despite their potential, deploying VLA models in realworld driving environments faces a critical challenge: the efficiency-accuracy trade-off in processing high-dimensional spatiotemporal data.
For large vision-language models (VLMs), SparseVLM [25], and MADTP [2] leverage cross-modal alignment or textual instructions to guide visual token pruning, yet they either operate outside the LLM or treat temporal frames as independent images, without explicitly modeling multi-view, multi-frame temporal dependencies within the LLM.
Our key insight is that efficiency must be addressed at both the temporal and spatial levels: historical observations contain significant redundancy, and even after temporal compression, the resulting spatial representation remains computationally intensive for the LLM.
Consequently, the total number of tokens—and thus the computational complexity—scales multiplicatively with both the number of frames and the number of views, making the “token bloat” problem even more severe and rendering infeasible on vehicle-embedded hardware.

Core idea & method

Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.
To alleviate this bottleneck, we propose ETAVLA, an Efficient Token Adaptation framework for VLA models.
Despite their potential, deploying VLA models in realworld driving environments faces a critical challenge: the efficiency-accuracy trade-off in processing high-dimensional spatiotemporal data.
Standard VLA architectures typically flatten visual tokens from multi-view history and concatenate them with text prompts.
Prior approaches to mitigate this issue generally fall into two categories.
Consequently, the total number of tokens—and thus the computational complexity—scales multiplicatively with both the number of frames and the number of views, making the “token bloat” problem even more severe and rendering infeasible on vehicle-embedded hardware.

Actual findings

We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines.
HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs.

How the conclusion was reached

Step 1 — Proposed approach: Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.
Step 2 — Evaluation setup or comparison basis: Notably, our method prunes 85% of visual tokens and reduces inference FLOPs by 61%, but still retaining 94% of the original accuracy on the NAVSIM v2 benchmark.
Step 3 — Main reported evidence: We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines.
Step 4 — Additional supporting or qualifying result: HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs.
Step 5 — Claim boundary / limitation: We introduced ETA-VLA, a novel Vision-LanguageAction (VLA) model designed to address the computational challenges of multi-frame, multi-view input in autonomous driving.

Experimental setup & results

We instantiate ETA-VLA on the NAVSIM v2 benchmark [6] and achieve an EPDMS of 85.0 on Navtest while saving 32% FLOPs, demonstrating its high-fidelity driving performance compared with strong baselines.
HoloV [27] operates on the ViT and rethinks retention holistically; by adaptively distributing pruning budgets across spatial crops, it ensures tokens capture global context rather than isolated features, achieving superior efficiency-accuracy trade-offs.
However, it relies on the standard causal attention matrix, which is influenced by positional encoding and may not fully capture pure semantic relevance, thus failing to achieve the human-like, contextaware attention allocation required for safe driving.
SparseVLM [25] designs a trainingfree framework using text tokens as “raters” to score visual tokens via self-attention matrices.

Limitations & risks

We introduced ETA-VLA, a novel Vision-LanguageAction (VLA) model designed to address the computational challenges of multi-frame, multi-view input in autonomous driving.
This work paves the way for deploying highreasoning VLA models on resource-constrained automotive hardware.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 잠재력에도 불구하고 VLA 모델을 실제 운전 환경에 배포하는 것은 중요한 과제에 직면해 있습니다. 즉, 고차원 시공간 데이터를 처리할 때 효율성과 정확성의 균형을 맞추는 것입니다. 대규모 비전 언어 모델(VLM)의 경우 SparseVLM [25] 및 MADTP [2]는 모달 간 얼라인먼트 또는 텍스트 지침을 활용하여 시각적 토큰 가지치기를 안내하지만 LLM 내에서 다중 뷰, 다중 프레임 시간 종속성을 명시적으로 모델링하지 않고 LLM 외부에서 작동하거나 시간 프레임을 독립적인 이미지로 처리합니다. 우리의 핵심 통찰력은 효율성이 시간적 및 공간적 수준 모두에서 해결되어야 한다는 것입니다. 과거 관측에는 상당한 중복성이 포함되어 있으며 시간적 압축 후에도 결과 공간 표현은 LLM에 대해 계산 집약적으로 유지됩니다. 핵심 제안은 특히 우리의 방법이 시각적 토큰의 85%를 잘라내고 추론 FLOP를 61% 줄이면서도 NAVSIM v2 벤치마크에서 원래 정확도의 94%를 여전히 유지한다는 것입니다. 이러한 병목 현상을 완화하기 위해 우리는 VLA 모델을 위한 효율적인 토큰 적응 프레임워크인 ETAVLA를 제안합니다. 잠재력에도 불구하고 실제 운전 환경에 VLA 모델을 배포하는 것은 고차원 시공간 데이터 처리의 효율성과 정확성의 균형이라는 중요한 과제에 직면해 있습니다. 표준 VLA 아키텍처는 일반적으로 다중 뷰 기록의 시각적 토큰을 평면화하고 이를 텍스트 프롬프트와 연결합니다. 경험적 사례는 주목할 만한 점으로, 우리의 방법은 시각적 토큰의 85%를 제거하고 추론 FLOP를 61% 줄이면서도 NAVSIM v2 벤치마크에서 원래 정확도의 94%를 여전히 유지합니다. NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하는 동시에 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도 높은 주행 성능을 보여줍니다. HoloV [27]는 ViT에서 작동하며 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다. 그러나 이는 위치 인코딩의 영향을 받고 순수한 의미 관련성을 완전히 포착하지 못할 수 있는 표준 인과 주의 매트릭스에 의존하므로 안전 운전에 필요한 인간과 유사한 상황 인식 주의 할당을 달성하지 못할 수 있습니다. 보고된 중앙 결과는 NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하는 동시에 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도가 높은 주행 성능을 입증한다는 것입니다. HoloV [27]는 ViT에서 작동하며 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다. 그러나 이는 위치 인코딩의 영향을 받고 순수한 의미 관련성을 완전히 포착하지 못할 수 있는 표준 인과 주의 매트릭스에 의존하므로 안전 운전에 필요한 인간과 유사한 상황 인식 주의 할당을 달성하지 못할 수 있습니다. SparseVLM [25]은 텍스트 토큰을 "평가자"로 사용하여 self-attention을 통해 시각적 토큰에 점수를 매기는 trainingfree 프레임워크를 설계합니다. 행렬. 또한 이 논문에서는 자율 주행에서 다중 프레임, 다중 뷰 입력의 계산 문제를 해결하도록 설계된 새로운 VLA(Vision-LanguageAction) 모델인 ETA-VLA를 도입했음을 분명히 밝혔습니다. 이 작업은 리소스가 제한된 자동차 하드웨어에 고도의 추론 VLA 모델을 배포할 수 있는 길을 열어줍니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하면서 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도가 높은 주행 성능을 보여줍니다.
가장 중요한 지원 결과: HoloV [27]는 ViT에서 작동하고 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다.
중요 주의 사항: 자율 주행에서 다중 프레임, 다중 뷰 입력의 계산 문제를 해결하도록 설계된 새로운 VLA(Vision-LanguageAction) 모델인 ETA-VLA를 도입했습니다.

문제 정의

잠재력에도 불구하고 실제 운전 환경에 VLA 모델을 배포하는 것은 고차원 시공간 데이터 처리의 효율성과 정확성의 균형이라는 중요한 과제에 직면해 있습니다.
대규모 비전 언어 모델(VLM)의 경우 SparseVLM [25] 및 MADTP [2]는 모달 간 얼라인먼트 또는 텍스트 지침을 활용하여 시각적 토큰 가지치기를 안내하지만 LLM 내에서 다중 뷰, 다중 프레임 시간 종속성을 명시적으로 모델링하지 않고 LLM 외부에서 작동하거나 시간 프레임을 독립적인 이미지로 처리합니다.
우리의 핵심 통찰력은 효율성이 시간적 및 공간적 수준 모두에서 해결되어야 한다는 것입니다. 과거 관측에는 상당한 중복성이 포함되어 있으며 시간적 압축 후에도 결과 공간 표현은 LLM에 대해 계산 집약적으로 유지됩니다.
결과적으로 총 토큰 수와 이에 따른 계산 복잡성은 프레임 수와 뷰 수에 따라 곱셈적으로 확장되어 "토큰 팽창" 문제를 더욱 심각하게 만들고 차량 내장 하드웨어에서 실행 불가능하게 만듭니다.

핵심 아이디어/방법

특히, 우리의 방법은 시각적 토큰의 85%를 제거하고 추론 FLOP를 61% 줄이면서도 NAVSIM v2 벤치마크에서 원래 정확도의 94%를 여전히 유지합니다.
이러한 병목 현상을 완화하기 위해 우리는 VLA 모델을 위한 효율적인 토큰 적응 프레임워크인 ETAVLA를 제안합니다.
잠재력에도 불구하고 실제 운전 환경에 VLA 모델을 배포하는 것은 고차원 시공간 데이터 처리의 효율성과 정확성의 균형이라는 중요한 과제에 직면해 있습니다.
표준 VLA 아키텍처는 일반적으로 다중 뷰 기록의 시각적 토큰을 평면화하고 이를 텍스트 프롬프트와 연결합니다.
이 문제를 완화하기 위한 이전 접근 방식은 일반적으로 두 가지 범주로 나뉩니다.
결과적으로 총 토큰 수와 이에 따른 계산 복잡성은 프레임 수와 뷰 수에 따라 곱셈적으로 확장되어 "토큰 팽창" 문제를 더욱 심각하게 만들고 차량 내장 하드웨어에서 실행 불가능하게 만듭니다.

실제 결과

NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하는 동시에 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도 높은 주행 성능을 보여줍니다.
HoloV [27]는 ViT에서 작동하며 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 특히, 우리의 방법은 시각적 토큰의 85%를 제거하고 추론 FLOP를 61% 줄이면서도 NAVSIM v2 벤치마크에서 원래 정확도의 94%를 여전히 유지합니다.
2단계 — 평가 설정 또는 비교 기준: 특히 우리의 방법은 시각적 토큰의 85%를 제거하고 추론 FLOP를 61% 줄이면서도 NAVSIM v2 벤치마크에서 원래 정확도의 94%를 여전히 유지합니다.
3단계 - 보고된 주요 증거: NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하면서 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도가 높은 주행 성능을 보여줍니다.
4단계 — 추가 지원 또는 적격 결과: HoloV [27]는 ViT에서 작동하고 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다.
5단계 — 청구 경계/제한: 자율 주행에서 다중 프레임, 다중 뷰 입력의 계산 문제를 해결하도록 설계된 새로운 VLA(Vision-LanguageAction) 모델인 ETA-VLA를 도입했습니다.

실험 설정/결과

NAVSIM v2 벤치마크[6]에서 ETA-VLA를 인스턴스화하고 Navtest에서 85.0의 EPDMS를 달성하는 동시에 32% FLOP를 절약하여 강력한 기준과 비교하여 충실도 높은 주행 성능을 보여줍니다.
HoloV [27]는 ViT에서 작동하며 보존을 전체적으로 재고합니다. 공간적 작물 전반에 걸쳐 가지치기 예산을 적응적으로 분배함으로써 토큰이 격리된 기능이 아닌 글로벌 컨텍스트를 캡처하도록 보장하여 뛰어난 효율성과 정확성의 절충점을 달성합니다.
그러나 이는 위치 인코딩의 영향을 받고 순수한 의미 관련성을 완전히 포착하지 못할 수 있는 표준 인과 주의 매트릭스에 의존하므로 안전 운전에 필요한 인간과 유사한 상황 인식 주의 할당을 달성하지 못할 수 있습니다.
SparseVLM [25]은 self-attention 행렬을 통해 시각적 토큰의 점수를 매기기 위해 텍스트 토큰을 "평가자"로 사용하여 trainingfree 프레임워크를 설계합니다.

한계/리스크

우리는 자율 주행에서 다중 프레임, 다중 뷰 입력의 계산 문제를 해결하도록 설계된 새로운 VLA(Vision-LanguageAction) 모델인 ETA-VLA를 소개했습니다.
이 작업은 리소스가 제한된 자동차 하드웨어에 고도의 추론 VLA 모델을 배포할 수 있는 길을 열어줍니다.