← ListarXivPDF

#9 Modality-Native Routing in Agent-to-Agent Networks: A Multimodal A2 A Protocol Extension

Score: 35.1 | Matched keywords: agent, benchmark, llm, multi-agent, multimodal, reasoning

Categories: cs.AI, cs.MA, cs.SE

Abstract Snapshot

Compressed abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2 A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves.

Main idea

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient.

Method signal

We show that modality-native routing in Agent-to-Agent (A2 A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs.

Contribution signal

36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2 A, an architecture layer atop A2 A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality.

Original Abstract

Preserving multimodal signals across agent boundaries is necessary for accurate cross-modal reasoning, but it is not sufficient. We show that modality-native routing in Agent-to-Agent (A2 A) networks improves task accuracy by 20 percentage points over text-bottleneck baselines, but only when the downstream reasoning agent can exploit the richer context that native routing preserves. An ablation replacing LLM-backed reasoning with keyword matching eliminates the accuracy gap entirely (36% vs. 36%), establishing a two-layer requirement: protocol-level routing must be paired with capable agent-level reasoning for the benefit to materialize. We present MMA2 A, an architecture layer atop A2 A that inspects Agent Card capability declarations to route voice, image, and text parts in their native modality. On CrossModal-CS, a controlled 50-task benchmark with the same LLM backend, same tasks, and only the routing path varying, MMA2 A achieves 52% task completion accuracy versus 32% for the text-bottleneck baseline (95% bootstrap CI on : [8, 32] pp; McNemar's exact p = 0.006). Gains concentrate on vision-dependent tasks: product defect reports improve by +38.5 pp and visual troubleshooting by +16.7 pp. This accuracy gain comes at a 1.8 latency cost from native multimodal processing. These results suggest that routing is a first-order design variable in multi-agent systems, as it determines the information available for downstream reasoning.