2026-05-05 · arXiv Daily Keyword Digest (Top 10 of 1000)

Generated: 2026-05-06T08:02:19.699691+09:00

Target date (KST): 2026-05-05

Selection: picked 10 from 1000 papers published on the target date

Source: https://export.arxiv.org/api/query (`cat:cs.*`, sorted by submittedDate desc)

Selection logic: keyword-weight score + subject boost

#1 12 Angry AI Agents: Evaluating Multi-Agent LLM Decision-Making Through Cinematic Jury Deliberation

Score: 56.2

Matched keywords: agent, agent framework, ai, ai agents, alignment, benchmark, large language models, llm, multi-agent, prompt, rlhf

Categories: cs.AI

Compressed abstract: What if the twelve jurors of Sidney Lumet's 12 Angry Men (1957) were not men, but large language models? Would the one juror who disagrees still be able to change everyone's mind?

Open summary page · arXiv · PDF

#2 Reinforcement Learning for LLM-based Multi-Agent Systems through Orchestration Traces

Score: 31.4

Matched keywords: agent, large language model, llm, multi-agent, token, tool use

Categories: cs.CL

Compressed abstract: As large language model (LLM) agents evolve from isolated tool users into coordinated teams, reinforcement learning (RL) must optimize not only individual actions but also how work is spawned, delegated, communicated, aggregated, and stopped. This paper studies RL for LLM-based multi-agent systems through orchestration traces: temporal interaction graphs whose events include sub-agent spawning, delegation, communica…

Open summary page · arXiv · PDF

#3 When Stress Becomes Signal: Detecting Antifragility-Compatible Regimes in Multi-Agent LLM Systems

Score: 31.7

Matched keywords: agent, benchmark, llm, multi-agent, reasoning

Categories: cs.MA, cs.AI, cs.CE

Compressed abstract: Multi-agent LLM systems are increasingly used to solve complex tasks through decomposition, debate, specialization, and ensemble reasoning. However, these systems are usually evaluated in terms of robustness: whether performance is preserved under perturbation.

Open summary page · arXiv · PDF

#4 ClarifySTL: An Interactive LLM Agent Framework for STL Transformation through Requirements Clarification

Score: 33.0

Matched keywords: agent, agent framework, benchmark, large language models, llm

Categories: cs.SE, cs.FL

Compressed abstract: Signal Temporal Logic (STL) is a formal language for specifying real-time behaviors of cyber-physical systems (CPS). Automatically transforming natural language requirements into STL specifications has received growing attention.

Open summary page · arXiv · PDF

#5 When Embedding-Based Defenses Fail: Rethinking Safety in LLM-Based Multi-Agent Systems

Score: 32.2

Matched keywords: agent, large language model, llm, multi-agent, token

Categories: cs.CR, cs.LG, cs.MA

Compressed abstract: Large language model (LLM)-powered multi-agent systems (MAS) enable agents to communicate and share information, achieving strong performance on complex tasks. However, this communication also creates an attack surface where malicious agents can propagate misinformation and manipulate group decisions, undermining MAS safety.

Open summary page · arXiv · PDF

#6 The Reasoning Trap: An Information-Theoretic Bound on Closed-System Multi-Step LLM Reasoning

Score: 34.2

Matched keywords: agent, llm, multi-agent, reasoning

Categories: cs.CL

Compressed abstract: When copies of the same language model are prompted to debate, they produce diverse phrasings of one perspective rather than diverse perspectives. Multi-agent debate (MAD), and more broadly closed-system reasoning where agents iteratively transform each other's outputs, tends to preserve answer accuracy while degrading the reasoning behind those answers.

Open summary page · arXiv · PDF

#7 QASecClaw: A Multi-Agent LLM Approach for False Positive Reduction in Static Application Security Testing

Score: 26.5

Matched keywords: agent, benchmark, large language model, llm, multi-agent

Categories: cs.CR, cs.SE

Compressed abstract: Static Application Security Testing tools help developers find security vulnerabilities before release, but they often produce many false positives. This increases manual review effort, reduces developer trust, and may cause real vulnerabilities to be ignored among noisy reports.

Open summary page · arXiv · PDF

#8 A Compound AI Agent for Conversational Grant Discovery

Score: 27.7

Matched keywords: agent, ai, ai agent, llm, reasoning

Categories: cs.AI

Compressed abstract: Research funding discovery remains fundamentally fragmented: researchers navigate disparate agency portals (e.g., in the United States, NSF, NIH, DARPA, Grants.gov, and many others) with heterogeneous interfaces, search capabilities, and data schemas. We present a compound AI system that unifies this landscape through two tightly coupled components: (1) an aggregation layer that autonomously collects, normalizes, an…

Open summary page · arXiv · PDF

#9 NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

Score: 17.2

Matched keywords: agent, benchmark, llm

Categories: cs.AI

Compressed abstract: Outcome-only evaluation under-specifies whether an evaluated agent profile preserves the commitments required to solve a multi-turn task coherently. NeuroState-Bench is a human-calibrated benchmark that operationalizes commitment integrity through benchmark-defined side-query probes rather than inferred hidden activations.

Open summary page · arXiv · PDF

#10 Multi-Agent Reasoning Improves Compute Efficiency: Pareto-Optimal Test-Time Scaling

Score: 23.2

Matched keywords: agent, multi-agent, reasoning

Categories: cs.AI

Compressed abstract: Advances in inference methods have enabled language models to improve their predictions without additional training. These methods often prioritize raw performance over cost-effective compute usage.

Open summary page · arXiv · PDF