#1 ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
Score: 31.9
Matched keywords: agent, llm, multi-agent, reasoning
Categories: cs.AI, cs.CL
Compressed abstract: We present ActuBench, a multi-agent LLM pipeline for the automated generation and evaluation of advanced actuarial assessment items aligned with the International Actuarial Association (IAA) Education Syllabus. The pipeline separates four LLM roles by adapter: one agent drafts items, one constructs distractors, a third independently verifies both stages and drives bounded one-shot repair loops, and a cost-optimized…
Open summary page · arXiv · PDF
#2 Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows
Score: 39.7
Matched keywords: agent, ai, benchmark, large language models, llm, multi-agent, reasoning
Categories: cs.CL
Compressed abstract: Multi-agent systems built from teams of large language models (LLMs) are increasingly deployed for collaborative scientific reasoning and problem-solving. These systems require agents to coordinate under shared constraints, such as GPUs or credit balances, where cooperative behavior matters.
Open summary page · arXiv · PDF
#3 EvoAgent: An Evolvable Agent Framework with Skill Learning and Multi-Agent Delegation
Score: 35.7
Matched keywords: agent, agent framework, large language model, llm, multi-agent
Categories: cs.AI
Compressed abstract: This paper proposes EvoAgent - an evolvable large language model (LLM) agent framework that integrates structured skill learning with a hierarchical sub-agent delegation mechanism. EvoAgent models skills as multi-file structured capability units equipped with triggering mechanisms and evolutionary metadata, and enables continuous skill generation and optimization through a user-feedback-driven closed-loop process.
Open summary page · arXiv · PDF
#4 Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
Score: 33.2
Matched keywords: agent, large language model, llm, machine learning, multi-agent
Categories: cs.AI
Compressed abstract: Automated feature generation extracts informative features from raw tabular data without manual intervention and is crucial for accurate, generalizable machine learning. Traditional methods rely on predefined operator libraries and cannot leverage task semantics, limiting their ability to produce diverse, high-value features for complex tasks.
Open summary page · arXiv · PDF
#5 Taint-Style Vulnerability Detection and Confirmation for Node.js Packages Using LLM Agent Reasoning
Score: 27.7
Matched keywords: agent, large language models, llm, reasoning
Categories: cs.CR, cs.AI, cs.SE
Compressed abstract: The rapidly evolving Node.js ecosystem currently includes millions of packages and is a critical part of modern software supply chains, making vulnerability detection of Node.js packages increasingly important. However, traditional program analysis struggles in this setting because of dynamic JavaScript features and the large number of package dependencies.
Open summary page · arXiv · PDF
#6 TriEx: A Game-based Tri-View Framework for Explaining Internal Reasoning in Multi-Agent LLMs
Score: 30.9
Matched keywords: agent, large language model, llm, multi-agent, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: Explainability for Large Language Model (LLM) agents is especially challenging in interactive, partially observable settings, where decisions depend on evolving beliefs and other agents. We present TriEx, a tri-view explainability framework that instruments sequential decision making with aligned artifacts: (i) structured first-person self-reasoning bound to an action, (ii) explicit second-person belief states about…
Open summary page · arXiv · PDF
#7 Mol-Debate: Multi-Agent Debate Improves Structural Reasoning in Molecular Design
Score: 25.0
Matched keywords: agent, ai, fine-tuning, multi-agent, rag, reasoning
Categories: cs.AI, cs.LG
Compressed abstract: Text-guided molecular design is a key capability for AI-driven drug discovery, yet it remains challenging to map sequential natural-language instructions with non-linear molecular structures under strict chemical constraints. Most existing approaches, including RAG, CoT prompting, and fine-tuning or RL, emphasize a small set of ad-hoc reasoning perspectives implemented in a largely one-shot generation pipeline.
Open summary page · arXiv · PDF
#8 Auditing and Controlling AI Agent Actions in Spreadsheets
Score: 36.7
Matched keywords: agent, ai, ai agent, ai agents, reasoning
Categories: cs.HC, cs.AI, cs.CE
Compressed abstract: Advances in AI agent capabilities have outpaced users' ability to meaningfully oversee their execution. AI agents can perform sophisticated, multi-step knowledge work autonomously from start to finish, yet this process remains effectively inaccessible during execution, often buried within large volumes of intermediate reasoning and outputs: by the time users receive the output, all underlying decisions have already…
Open summary page · arXiv · PDF
#9 ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models
Score: 16.4
Matched keywords: benchmark, large language models, reasoning
Categories: cs.AI, cs.CL, cs.LG
Compressed abstract: We present ThermoQA, a benchmark of 293 open-ended engineering thermodynamics problems in three tiers: property lookups (110 Q), component analysis (101 Q), and full cycle analysis (82 Q). Ground truth is computed programmatically from CoolProp 7.2.0, covering water, R-134 a, and variable-cp air.
Open summary page · arXiv · PDF
#10 Separable Pathways for Causal Reasoning: How Architectural Scaffolding Enables Hypothesis-Space Restructuring in LLM Agents
Score: 28.2
Matched keywords: ai, ai agents, llm, reasoning
Categories: cs.AI, cs.LG
Compressed abstract: Causal discovery through experimentation and intervention is fundamental to robust problem solving. It requires not just updating beliefs within a fixed framework but revising the hypothesis space itself, a capacity current AI agents lack when evidence demands representations they have not previously constructed.
Open summary page · arXiv · PDF