#1 Cutscene Agent: An LLM Agent Framework for Automated 3 D Cutscene Generation
Score: 38.4
Matched keywords: agent, agent framework, benchmark, llm, multi-agent, reasoning
Categories: cs.GR, cs.AI, cs.CL
Compressed abstract: Cutscenes are carefully choreographed cinematic sequences embedded in video games and interactive media, serving as the primary vehicle for narrative delivery, character development, and emotional engagement. Producing cutscenes is inherently complex: it demands seamless coordination across screenwriting, cinematography, character animation, voice acting, and technical direction, often requiring days to weeks of col…
Open summary page · arXiv · PDF
#2 Agentic Harness Engineering: Observability-Driven Automatic Evolution of Coding-Agent Harnesses
Score: 31.4
Matched keywords: agent, benchmark, harness, harness engineering, token
Categories: cs.CL, cs.SE
Compressed abstract: Harnesses have become a central determinant of coding-agent performance, shaping how models interact with repositories, tools, and execution environments. Yet automating harness engineering is hard: a heterogeneous action space, sparse and noisy evaluation signal, multi-million-token trajectories, and edits whose effect is hard to attribute to the next round's outcomes.
Open summary page · arXiv · PDF
#3 LLM-ReSum: A Framework for LLM Reflective Summarization through Self-Evaluation
Score: 20.4
Matched keywords: alignment, benchmark, large language model, llm
Categories: cs.CL, cs.AI, cs.DL, cs.IR
Compressed abstract: Reliable evaluation of large language model (LLM)-generated summaries remains an open challenge, particularly across heterogeneous domains and document lengths. We conduct a comprehensive meta-evaluation of 14 automatic summarization metrics and LLM-based evaluators across seven datasets spanning five domains, covering documents from short news articles to long scientific, governmental, and legal texts (2 K-27 K wor…
Open summary page · arXiv · PDF
#4 A Comparative Evaluation of AI Agent Security Guardrails
Score: 23.2
Matched keywords: agent, ai, ai agent
Categories: cs.CR, cs.AI
Compressed abstract: This report presents a comparative evaluation of DKnownAI Guard in AI agent security scenarios, benchmarked against three competing products: AWS Bedrock Guardrails, Azure Content Safety, and Lakera Guard. Using human annotation as the ground truth, we assess each guardrail's ability to detect two categories of risks: threats to the agent itself (e.g., instruction override, indirect injection, tool abuse) and reques…
Open summary page · arXiv · PDF
#5 BenchGuard: Who Guards the Benchmarks? Automated Auditing of LLM Agent Benchmarks
Score: 24.9
Matched keywords: agent, ai, benchmark, llm
Categories: cs.CL, cs.AI, cs.SE
Compressed abstract: As benchmarks grow in complexity, many apparent agent failures are not failures of the agent at all - they are failures of the benchmark itself: broken specifications, implicit assumptions, and rigid evaluation scripts that penalize valid alternative approaches. We propose employing frontier LLMs as systematic auditors of evaluation infrastructure, and realize this vision through BenchGuard, the first automated audi…
Open summary page · arXiv · PDF
#6 MARD: A Multi-Agent Framework for Robust Android Malware Detection
Score: 47.2
Matched keywords: agent, agent framework, fine-tuning, large language models, llm, machine learning, multi-agent, reasoning, token
Categories: cs.CR, cs.SE
Compressed abstract: With the rapid evolution of Android applications, traditional machine learning-based detection models suffer from concept drift. Additionally, they are constrained by shallow features, lacking deep semantic understanding and interpretability of decisions.
Open summary page · arXiv · PDF
#7 Analyzing LLM Reasoning to Uncover Mental Health Stigma
Score: 22.4
Matched keywords: benchmark, large language models, llm, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: While large language models (LLMs) are increasingly being explored for mental health applications, recent studies reveal that they can exhibit stigma toward individuals with psychological conditions. Existing evaluations of this stigma primarily rely on multiple-choice questions (MCQs), which fail to capture the biases embedded within the models' underlying logic.
Open summary page · arXiv · PDF
#8 Recursive Multi-Agent Systems
Score: 31.1
Matched keywords: agent, agent framework, code generation, multi-agent, reasoning, token
Categories: cs.AI, cs.CL, cs.LG
Compressed abstract: Recursive or looped language models have recently emerged as a new scaling axis by iteratively refining the same model computation over latent states to deepen reasoning. We extend such scaling principle from a single model to multi-agent systems, and ask: Can agent collaboration itself be scaled through recursion?
Open summary page · arXiv · PDF
#9 Cooperate to Compete: Strategic Coordination in Multi-Agent Conquest
Score: 20.4
Matched keywords: agent, ai, multi-agent
Categories: cs.AI, cs.CL
Compressed abstract: Language Model (LM)-based agents remain largely untested in mixed-motive settings where agents must leverage short-term cooperation for long-term competitive goals (e.g., multi-party politics). We introduce Cooperate to Compete (C2 C), a multi-agent environment where players can engage in private negotiations while competing to be the first to achieve their secret objective.
Open summary page · arXiv · PDF
#10 Doing More With Less: Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling
Score: 20.4
Matched keywords: large language models, llm, reasoning
Categories: cs.AI, cs.CL, cs.LG
Compressed abstract: While current Large Language Models (LLMs) exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), their massive parameter counts and high inference costs have motivated the development of pruning methods that can reduce model size without sacrificing performance. However, specific to reasoning LLMs, prior work has shown that structured pruning (methods which removes entire set of layer bl…
Open summary page · arXiv · PDF