arXiv daily keyword digest · 2026-04-29

#1 Cutscene Agent: An LLM Agent Framework for Automated 3 D Cutscene Generation

Score: 38.4

Matched keywords: agent, agent framework, benchmark, llm, multi-agent, reasoning

Categories: cs.GR, cs.AI, cs.CL

Compressed abstract: Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of col…

Open summary page · arXiv · PDF

#2 Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

Score: 31.4

Matched keywords: agent, benchmark, harness, harness engineering, token

Categories: cs.CL, cs.SE

Compressed abstract: Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes.

Open summary page · arXiv · PDF

#3 LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

Score: 20.4

Matched keywords: alignment, benchmark, large language model, llm

Categories: cs.CL, cs.AI, cs.DL, cs.IR

Compressed abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2 K-27 K wor…

Open summary page · arXiv · PDF

#4 A Comparative Evaluation of AI Agent Security Guardrails

Score: 23.2

Matched keywords: agent, ai, ai agent

Categories: cs.CR, cs.AI

Compressed abstract: This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and reques…

Open summary page · arXiv · PDF

#5 BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

Score: 24.9

Matched keywords: agent, ai, benchmark, llm

Categories: cs.CL, cs.AI, cs.SE

Compressed abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated audi…

Open summary page · arXiv · PDF

#6 MARD: A Multi-Agent Framework for Robust Android Malware Detection

Score: 47.2

Matched keywords: agent, agent framework, fine-tuning, large language models, llm, machine learning, multi-agent, reasoning, token

Categories: cs.CR, cs.SE

Compressed abstract: With the rapid evolution of Android applications, traditional machine learning-based detection models suffer from concept drift. Additionally, they are constrained by shallow features, lacking deep semantic understanding and interpretability of decisions.

Open summary page · arXiv · PDF

#7 Analyzing LLM Reasoning to Uncover Mental Health Stigma

Score: 22.4

Matched keywords: benchmark, large language models, llm, reasoning

Categories: cs.CL, cs.AI

Compressed abstract: While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models' underlying logic.

Open summary page · arXiv · PDF

#8 Recursive Multi-Agent Systems

Score: 31.1

Matched keywords: agent, agent framework, code generation, multi-agent, reasoning, token

Categories: cs.AI, cs.CL, cs.LG

Compressed abstract: Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion?

Open summary page · arXiv · PDF

#9 Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

Score: 20.4

Matched keywords: agent, ai, multi-agent

Categories: cs.AI, cs.CL

Compressed abstract: Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2 C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective.

Open summary page · arXiv · PDF

#10 Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

Score: 20.4

Matched keywords: large language models, llm, reasoning

Categories: cs.AI, cs.CL, cs.LG

Compressed abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer bl…

Open summary page · arXiv · PDF

2026-04-29 · arXiv Daily Keyword Digest (Top 10 of 550)

#1 Cutscene Agent: An LLM Agent Framework for Automated 3 D Cutscene Generation

#2 Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses

#3 LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation

#4 A Comparative Evaluation of AI Agent Security Guardrails

#5 BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks

#6 MARD: A Multi-Agent Framework for Robust Android Malware Detection

#7 Analyzing LLM Reasoning to Uncover Mental Health Stigma

#8 Recursive Multi-Agent Systems

#9 Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest

#10 Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling