arXiv daily keyword digest · 2026-05-15

#1 GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

Score: 31.4

Matched keywords: agent, ai, benchmark, llm, multi-agent, reasoning

Categories: cs.CL, cs.AI, cs.LG, cs.MA

Compressed abstract: In multi-agent systems (MAS), a single deceptive agent can nullify all gains of an agentic AI collective and evade deployed defenses. However, existing adversarial studies on MAS target only shallow tasks and do not consider adaptive adversaries, which evolve their strategies to evade the very detectors trained to catch them.

Open summary page · arXiv · PDF

#2 Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

Score: 29.0

Matched keywords: agent, ai, ai agents, llm, multi-agent

Categories: cs.SE

Compressed abstract: The growing adoption of AI agents and the Model Context Protocol (MCP) has motivated organizations to expose existing REST APIs as agent-consumable tools. In our industrial context, this initiative targeted an ecosystem of 16 production APIs comprising approximately 600 endpoints.

Open summary page · arXiv · PDF

#3 Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

Score: 27.4

Matched keywords: large language models, llm, reasoning, tool use

Categories: cs.AI, cs.CL

Compressed abstract: Tool use extends large language models beyond parametric knowledge, but reliable execution requires balancing appropriate reasoning depth with strict structural validity. We approach this problem from a case-based perspective to present CAST, a case-driven framework that treats historical execution trajectories as structured cases.

Open summary page · arXiv · PDF

#4 Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

Score: 36.7

Matched keywords: agent, benchmark, large language models, llm, multi-agent, reasoning

Categories: cs.AI

Compressed abstract: We introduce {Cattle Trade, a multi-agent benchmark for evaluating large language models (LLMs) as agents in strategic reasoning under imperfect information, adversarial interaction, and resource constraints. The benchmark combines auctions, hidden-offer trade challenges (TCs), bargaining, bluffing, opponent modeling, and resource allocation within a single long-horizon game lasting 50--60 turns.

Open summary page · arXiv · PDF

#5 Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

Score: 27.7

Matched keywords: agent, ai, alignment, llm, multi-agent

Categories: cs.AI, cs.CY, cs.MA

Compressed abstract: Multi-agent orchestration -- in which a hidden coordinator manages specialized worker agents -- is becoming the default architecture for enterprise AI deployment, yet the safety implications of orchestrator invisibility have never been empirically tested. We conducted a preregistered 3 x2 experiment (365 runs, 5 agents per run) crossing three organizational structures (visible leader, invisible orchestrator, flat) w…

Open summary page · arXiv · PDF

#6 IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification

Score: 34.2

Matched keywords: agent, agent framework, large language model, llm, multi-agent

Categories: cs.MA, cs.AI

Compressed abstract: Operational plan generation and verification are critical for modern complex and rapidly changing battlefield environments, yet traditional generation and verification methods still respectively face the challenges of generation infeasibility and verification insufficiency. To alleviate these limitations, we propose an Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verifi…

Open summary page · arXiv · PDF

#7 SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

Score: 25.7

Matched keywords: ai, large language models, llm, reasoning

Categories: cs.SD, cs.AI, cs.LG, cs.MM, eess.AS

Compressed abstract: As audio-first agents become increasingly common in physical AI, conversational robots, and screenless wearables, audio large language models (audio-LLMs) must integrate speaker-specific understanding to support user authorization, personalization, and context-aware interaction. This requires modeling who is speaking, how the voice sounds, and how recording conditions affect speaker cues.

Open summary page · arXiv · PDF

#8 GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Score: 32.2

Matched keywords: agent, benchmark, large language model, llm, reasoning

Categories: cs.CL

Compressed abstract: Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users in…

Open summary page · arXiv · PDF

#9 Coding Agent Is Good As World Simulator

Score: 28.3

Matched keywords: agent, code agent, code generation, coding agent, prompt

Categories: cs.AI

Compressed abstract: World models have emerged as a powerful paradigm for building interactive simulation environments, with recent video-based approaches demonstrating impressive progress in generating visually plausible dynamics. However, because these models typically infer dynamics from video and represent them in latent states, they do not explicitly enforce physical constraints.

Open summary page · arXiv · PDF

#10 Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use

Score: 26.6

Matched keywords: large language models, llm, token, tool use

Categories: cs.AI

Compressed abstract: Large language models (LLMs) increasingly act as autonomous agents that must decide when to answer directly vs. when to invoke external tools.

Open summary page · arXiv · PDF

2026-05-15 · arXiv Daily Keyword Digest (Top 10 of 792)

#1 GAMBIT: A Three-Mode Benchmark for Adversarial Robustness in Multi-Agent LLM Collectives

#2 Making OpenAPI Documentation Agent-Ready: Detecting Documentation and REST Smells with a Multi-Agent LLM System

#3 Case-Based Calibration of Adaptive Reasoning and Execution for LLM Tool Use

#4 Cattle Trade: A Multi-Agent Benchmark for LLM Bluffing, Bidding, and Bargaining

#5 Invisible Orchestrators Suppress Protective Behavior and Dissociate Power-Holders: Safety Risks in Multi-Agent LLM Systems

#6 IFPV: An Integrated Multi-Agent Framework for Generative Operational Planning and High-Fidelity Plan Verification

#7 SpeakerLLM: A Speaker-Specialized Audio-LLM for Speaker Understanding and Verification Reasoning

#8 GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

#9 Coding Agent Is Good As World Simulator

#10 Model-Adaptive Tool Necessity Reveals the Knowing-Doing Gap in LLM Tool Use