arXiv daily keyword digest · 2026-04-27

#1 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

Score: 29.7

Matched keywords: agent, ai, llm, multi-agent

Categories: cs.LG, cs.AI

Compressed abstract: Emerging AI systems in behavioral health and psychiatry use multi-step or multi-agent LLM pipelines for tasks like assessing self-harm risk and screening for depression. However, common evaluation approaches, like LLM-as-a-judge, do not indicate when a decision is reliable or how errors may accumulate across multiple LLM judgements, limiting their suitability for safety-critical settings.

Open summary page · arXiv · PDF

#2 Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

Score: 28.5

Matched keywords: agent, benchmark, llm, multi-agent, reasoning

Categories: cs.MA

Compressed abstract: Failure attribution, i.e., identifying the responsible agent and decisive step of a failure, is particularly challenging in LLM-based multi-agent systems (MAS) due to their natural-language reasoning, nondeterministic outputs, and intricate interaction dynamics. A reliable benchmark is therefore essential to guide and evaluate attribution techniques.

Open summary page · arXiv · PDF

#3 AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Score: 25.2

Matched keywords: agent, ai, ai agent, benchmark

Categories: cs.AI, cs.IR, cs.MA

Compressed abstract: The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone.

Open summary page · arXiv · PDF

#4 Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

Score: 18.2

Matched keywords: benchmark, llm, rag, token

Categories: cs.IR, cs.AI, cs.LG

Compressed abstract: Dense vector retrieval is the practical backbone of Retrieval- Augmented Generation (RAG), but similarity search can suffer from precision limitations. Conversely, utility-based approaches leveraging LLM re-ranking often achieve superior performance but are computationally prohibitive and prone to noise inherent in perplexity estimation.

Open summary page · arXiv · PDF

#5 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

Score: 16.8

Matched keywords: ai, large language models, llm, prompt, reasoning

Categories: cs.LG, cs.AI

Compressed abstract: Large language models (LLMs) are increasingly utilized in clinical reasoning and risk assessment. However, their interpretive reliability in critical and indeterminate domains such as psychiatry remains unclear.

Open summary page · arXiv · PDF

#6 Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

Score: 23.2

Matched keywords: large language models, llm, reasoning

Categories: cs.AI

Compressed abstract: Recent advancements in large language models have led to significant improvements across various tasks, including mathematical reasoning, which is used to assess models' intelligence in logical reasoning and problem-solving. Models are evaluated on mathematical reasoning benchmarks by verifying the correctness of the final answer against a ground truth answer.

Open summary page · arXiv · PDF

#7 Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

Score: 28.0

Matched keywords: agent, alignment, large language model, llm, multi-agent

Categories: cs.CR

Compressed abstract: The offensive security landscape is highly fragmented: enterprise platforms avoid memory-corruption vulnerabilities due to Denial of Service (DoS) risks, Automatic Exploit Generation (AEG) systems suffer from semantic blindness, and Large Language Model (LLM) agents face safety alignment filters and "Live Fire" execution hazards. We introduce Automation-Exploit, a fully autonomous Multi-Agent System (MAS) framework…

Open summary page · arXiv · PDF

#8 Multi-Agent Consensus as a Cognitive Bias Trigger in Human-AI Interaction

Score: 26.0

Matched keywords: agent, ai, diffusion, llm, multi-agent

Categories: cs.HC

Compressed abstract: As multi-agent AI systems become more common, users increasingly encounter not a single AI voice but a collective one. This shift introduces social dynamics, such as consensus, dissent, and gradual convergence, that can trigger cognitive biases and distort human judgment.

Open summary page · arXiv · PDF

#9 DM^3-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation

Score: 19.4

Matched keywords: agent, multi-agent, multimodal

Categories: cs.MA, cs.RO

Compressed abstract: We present DM^3-Nav, a fully decentralized multi-agent semantic navigation system supporting multimodal open-vocabulary goal specification and multi-object missions. In our setting, decentralization implies operation without a central coordinator, global map aggregation, or shared global state at runtime.

Open summary page · arXiv · PDF

#10 FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting

Score: 15.6

Matched keywords: benchmark, foundation models, machine learning

Categories: cs.LG, cs.AI, cs.CE

Compressed abstract: Driven by the transition towards a climate-neutral energy system, accurate energy time series forecasting is critical for planning and operation. Yet, it remains largely a dataset-specific task, requiring comprehensive training data, limiting scalability, and resulting in high model development and maintenance effort.

Open summary page · arXiv · PDF

2026-04-27 · arXiv Daily Keyword Digest (Top 10 of 423)

#1 Reliable Self-Harm Risk Screening via Adaptive Multi-Agent LLM Systems

#2 Seeing the Whole Elephant: A Benchmark for Failure Attribution in LLM-based Multi-Agent Systems

#3 AgentSearchBench: A Benchmark for AI Agent Search in the Wild

#4 Aligning Dense Retrievers with LLM Utility via DistillationAligning Dense Retrievers with LLM Utility via Distillation

#5 Reliability Auditing for Downstream LLM tasks in Psychiatry: LLM-Generated Hospitalization Risk Scores

#6 Rethinking Math Reasoning Evaluation: A Robust LLM-as-a-Judge Framework Beyond Symbolic Rigidity

#7 Automation-Exploit: A Multi-Agent LLM Framework for Adaptive Offensive Security with Digital Twin-Based Risk-Mitigated Exploitation

#8 Multi-Agent Consensus as a Cognitive Bias Trigger in Human-AI Interaction

#9 DM^3-Nav: Decentralized Multi-Agent Multimodal Multi-Object Semantic Navigation

#10 FETS Benchmark: Foundation Models Outperform Dataset-specific Machine Learning in Energy Time Series Forecasting