#1 Will the Agent Recuse Itself? Measuring LLM-Agent Compliance with In-Band Access-Deny Signals
Score: 24.2
Matched keywords: agent, harness, llm
Categories: cs.CR, cs.AI
Compressed abstract: As autonomous LLM agents increasingly hold real credentials and operate infrastructure without a human in the loop, operators have no standard way to tell an agent that a resource is off-limits. Access controls either let the agent in (it has valid credentials) or hard-fail it (indistinguishable from any other client).
Open summary page · arXiv · PDF
#2 CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments
Score: 33.7
Matched keywords: agent, large language models, llm, multi-agent, reasoning, tool use
Categories: cs.CL
Compressed abstract: Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task u…
Open summary page · arXiv · PDF
#3 When Evidence is Sparse: Weakly Supervised Early Failure Alerting in Dialogs and LLM-Agent Trajectories
Score: 21.4
Matched keywords: agent, llm, tool use
Categories: cs.CL, cs.AI, cs.HC, cs.LG
Compressed abstract: Early failure alerting requires deciding, while a dialog or agent trajectory is still unfolding, whether to flag it as likely to fail. This is challenging because supervision is typically available only as a trajectory-level success/failure label while alerts must be raised from partial interactions.
Open summary page · arXiv · PDF
#4 Critic-Guided Heterogeneous Multi-Agent Reasoning for Reliable Mathematical Problem Solving
Score: 34.2
Matched keywords: agent, benchmark, large language models, llm, multi-agent, reasoning
Categories: cs.AI, cs.LG
Compressed abstract: Recent Large Language Models (LLMs) have shown impressive reasoning abilities; but they are still susceptible to hallucinations, intermediate reasoning mistakes, and unreliable reasoning results in complex mathematical reasoning problems. In this study, we introduce a critic-based heterogeneous multi-agent approach to improve the dependability of mathematical reasoning.
Open summary page · arXiv · PDF
#5 MARDoc: A Memory-Aware Refinement Agent Framework for Multimodal Long Document QA
Score: 30.8
Matched keywords: agent, agent framework, multimodal, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: Iterative retrieval-reasoning agents have recently shown promise for multimodal long-document question answering. However, most existing systems maintain a single growing context that mixes retrieval traces, observations, and intermediate reasoning.
Open summary page · arXiv · PDF
#6 Emotion-Aware Image Generation from Korean Diary Text via LLM-based Prompt Translation and LoRA Fine-Tuning
Score: 15.0
Matched keywords: diffusion, fine-tuning, llm, prompt
Categories: cs.CV, cs.AI
Compressed abstract: T2 I models cannot effectively capture sentiment from various types of text, including diaries, as they primarily focus on visual object-related patterns rather than contextual emotional understanding. This paper proposes an emotion-aware text-to-image pipeline that generates children's hand drawing style images from short Korean diary entries.
Open summary page · arXiv · PDF
#7 Personal AI Agent for Camera Roll VQA
Score: 36.7
Matched keywords: agent, ai, ai agent, ai agents, reasoning
Categories: cs.CV, cs.AI
Compressed abstract: We study the personal camera roll visual question answering setting. In this setting, a conversational AI assistant can access a user's personal camera roll and retrieve relevant photos to answer queries, ranging from simple factual questions (e.g., ``Name of the food I tried yesterday?'') to more open-ended ones (e.g., ``Recommend some dishes I have never eaten before'').
Open summary page · arXiv · PDF
#8 ReasoningFlow: Discourse Structures for Understanding LLM Reasoning Traces
Score: 15.4
Matched keywords: llm, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: Large reasoning models (LRMs) produce reasoning traces with non-linear structures, such as backtracking and self-correction, that complicate the evaluation and monitoring of the reasoning process. We introduce ReasoningFlow, a framework that captures the discourse structures of LRM reasoning traces into fine-grained directed acyclic graphs (DAGs).
Open summary page · arXiv · PDF
#9 Do More Agents Help? Controlled and Protocol-Aligned Evaluation of LLM Agent Workflows
Score: 29.2
Matched keywords: agent, benchmark, llm, multi-agent, reasoning
Categories: cs.AI
Compressed abstract: Does adding more agents help an LLM workflow once compared systems share the same benchmark loader, tool access, answer contract, usage accounting, and trajectory logging? We introduce BenchAgent, an evaluation framework that places single-agent, fixed multi-agent (MAS), and evolving MAS workflows under one normalized execution and logging protocol.
Open summary page · arXiv · PDF
#10 ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer
Score: 28.2
Matched keywords: agent, benchmark, coding agent, llm
Categories: cs.SE, cs.AI
Compressed abstract: The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose LLM-as-a-Developer, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it thro…
Open summary page · arXiv · PDF