Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2 A Protocol Extension

Score: 35.1 | Matched keywords: agent, benchmark, llm, multi-agent, multimodal, reasoning

Abstract Snapshot

Compressed abstract

Main idea

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient.

Method signal

We show that modality-native routing in Agent-to-Agent (A2 A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs.

Contribution signal

36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2 A, an architecture layer atop A2 A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality.

Original Abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2 A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2 A, an architecture layer atop A2 A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2 A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on : [8, 32] pp; McNemar's exact p = 0.006). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8 latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.