2026-06-09 · arXiv Daily Keyword Digest (Top 10 of 1000)

Generated: 2026-06-10T08:02:25.101804+09:00

Target date (KST): 2026-06-09

Selection: picked 10 from 1000 papers published on the target date

Source: https://export.arxiv.org/api/query (`cat:cs.*`, sorted by submittedDate desc)

Selection logic: keyword-weight score + subject boost

#1 Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

Score: 22.2

Matched keywords: agent, benchmark, llm

Categories: cs.LG, cs.AI

Compressed abstract: When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the…

Open summary page · arXiv · PDF

#2 Bayesian-Agent: Posterior-Guided Skill Evolution for LLM Agent Harnesses

Score: 31.8

Matched keywords: agent, harness, llm, prompt

Categories: cs.CL

Compressed abstract: LLM agents increasingly rely on external inference conditions: prompts, tools, memory, SOPs, skills, and harness feedback. These assets can improve task execution without changing model weights, but they are often revised by heuristic reflection or by reusing observed successes and failures as if counts alone were reliable belief.

Open summary page · arXiv · PDF

#3 Toward Human-Centered Multi-Agent Systems: Integrating Cognition, Culture, Values, and Cooperation in AI Agents

Score: 51.0

Matched keywords: agent, ai, ai agents, alignment, large language model, llm, multi-agent, tool use

Categories: cs.MA

Compressed abstract: The emergence of large language model (LLM)-based agents and multi-agent systems has enabled a shift from narrow task automation to more autonomous decision-making. Despite progress in language generation, planning, tool use, and coordination, most agents still treat intelligence as prediction, optimization, and task completion.

Open summary page · arXiv · PDF

#4 DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

Score: 40.3

Matched keywords: agent, fine-tuning, large language model, llm, multi-agent, prompt, reasoning

Categories: cs.LG

Compressed abstract: Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected.

Open summary page · arXiv · PDF

#5 The Token Not Taken: Sampling, State, and the Variability of AI Agent Outputs

Score: 23.8

Matched keywords: agent, ai, ai agent, foundation model, token

Categories: cs.AI, cs.CY, econ.GN

Compressed abstract: Agentic AI systems can behave differently across runs: the same request may produce a different plan, a different tool call, a different code edit, or a different final answer. Such variability arises from several layers that are often conflated.

Open summary page · arXiv · PDF

#6 Distilling LLM Reasoning into an Interpretable Policy Tree for Human-AI Collaboration

Score: 38.7

Matched keywords: agent, ai, large language models, llm, multi-agent, reasoning

Categories: cs.AI, cs.HC

Compressed abstract: Constructing efficient and reliable policies to assist humans is indispensable for human-AI collaboration. Existing methods mainly follow two lines of work.

Open summary page · arXiv · PDF

#7 AGENTSERVESIM: A Hardware-aware Simulator for Multi-Turn LLM Agent Serving

Score: 21.4

Matched keywords: agent, llm

Categories: cs.CL, cs.AI

Compressed abstract: Multi-turn LLM agents interleave model calls with external tool invocations, shifting serving from stateless request processing to stateful program execution. Serving these workloads requires scheduling, KV-cache management, and routing policies that use program-level context, including turn dependencies, tool-induced gaps, and reusable KV state.

Open summary page · arXiv · PDF

#8 AlloSpatial: Agentic Harness Framework for Spatial Reasoning in Foundation Models

Score: 28.8

Matched keywords: foundation models, harness, multimodal, reasoning, tool use

Categories: cs.AI

Compressed abstract: Multimodal Foundation Models (MFMs) have made substantial progress, yet remain fragile in spatial reasoning over the physical world. A key bottleneck lies in their inability to transform local egocentric observations into a global allocentric spatial representation.

Open summary page · arXiv · PDF

#9 Context-Fractured Decomposition Attacks on Tool-Using LLM Agents: Exploiting Artifact Provenance Gaps

Score: 28.2

Matched keywords: agent, llm, tool-using

Categories: cs.CR, cs.AI

Compressed abstract: Tool-using LLM agents interact with the world through actions that persist state in artifacts (e.g., workspace files or logs). Consequently, jailbreak defenses must reason about cross-step composition rather than isolated text.

Open summary page · arXiv · PDF

#10 REFLECT: Intervention-Supported Error Attribution for Silent Failures in LLM Agent Traces

Score: 20.7

Matched keywords: agent, large language model, llm, reasoning

Categories: cs.AI

Compressed abstract: Large language model (LLM) agents now solve complex tasks through long plan-and-execution traces, yet the ability to locate errors in a completed traces still lags far behind, especially in the silent failure regime. Existing approaches predict suspect steps via classifiers or LLM judges, or recover correct answers via retry, but none feed the intervention outcome back to refine the attribution itself.

Open summary page · arXiv · PDF