arXiv daily keyword digest · 2026-06-05

#1 Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

Score: 24.2

Matched keywords: agent, harness, llm

Categories: cs.CR, cs.AI

Compressed abstract: As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client).

Open summary page · arXiv · PDF

#2 CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Score: 33.7

Matched keywords: agent, large language models, llm, multi-agent, reasoning, tool use

Categories: cs.CL

Compressed abstract: Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task u…

Open summary page · arXiv · PDF

#3 When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

Score: 21.4

Matched keywords: agent, llm, tool use

Categories: cs.CL, cs.AI, cs.HC, cs.LG

Compressed abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions.

Open summary page · arXiv · PDF

#4 Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

Score: 34.2

Matched keywords: agent, benchmark, large language models, llm, multi-agent, reasoning

Categories: cs.AI, cs.LG

Compressed abstract: Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning.

Open summary page · arXiv · PDF

#5 MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

Score: 30.8

Matched keywords: agent, agent framework, multimodal, reasoning

Categories: cs.CL, cs.AI

Compressed abstract: Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning.

Open summary page · arXiv · PDF

#6 Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

Score: 15.0

Matched keywords: diffusion, fine-tuning, llm, prompt

Categories: cs.CV, cs.AI

Compressed abstract: T2 I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries.

Open summary page · arXiv · PDF

#7 Personal AI Agent for Camera Roll VQA

Score: 36.7

Matched keywords: agent, ai, ai agent, ai agents, reasoning

Categories: cs.CV, cs.AI

Compressed abstract: We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before'').

Open summary page · arXiv · PDF

#8 ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

Score: 15.4

Matched keywords: llm, reasoning

Categories: cs.CL, cs.AI

Compressed abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs).

Open summary page · arXiv · PDF

#9 Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

Score: 29.2

Matched keywords: agent, benchmark, llm, multi-agent, reasoning

Categories: cs.AI

Compressed abstract: Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol.

Open summary page · arXiv · PDF

#10 ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

Score: 28.2

Matched keywords: agent, benchmark, coding agent, llm

Categories: cs.SE, cs.AI

Compressed abstract: The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose LLM-as-a-Developer, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it thro…

Open summary page · arXiv · PDF

2026-06-05 · arXiv Daily Keyword Digest (Top 10 of 798)

#1 Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals

#2 CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

#3 When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories

#4 Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving

#5 MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA

#6 Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning

#7 Personal AI Agent for Camera Roll VQA

#8 ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces

#9 Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows

#10 ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer