#1 Can We Trust a Black-box LLM? LLM Untrustworthy Boundary Detection via Bias-Diffusion and Multi-Agent Reinforcement Learning
Score: 26.9
Matched keywords: agent, diffusion, large language models, llm, multi-agent
Categories: cs.AI, cs.CL
Compressed abstract: Large Language Models (LLMs) have shown a high capability in answering questions on a diverse range of topics. However, these models sometimes produce biased, ideologized or incorrect responses, limiting their applications if there is no clear understanding of which topics their answers can be trusted.
Open summary page · arXiv · PDF
#2 PaperOrchestra: A Multi-Agent Framework for Automated AI Research Paper Writing
Score: 31.2
Matched keywords: agent, agent framework, ai, benchmark, multi-agent
Categories: cs.AI, cs.LG, cs.MA
Compressed abstract: Synthesizing unstructured research materials into manuscripts is an essential yet under-explored challenge in AI-driven scientific discovery. Existing autonomous writers are rigidly coupled to specific experimental pipelines, and produce superficial literature reviews.
Open summary page · arXiv · PDF
#3 Your LLM Agent Can Leak Your Data: Data Exfiltration via Backdoored Tool Use
Score: 28.2
Matched keywords: agent, large language model, llm, tool use
Categories: cs.CR, cs.AI
Compressed abstract: Tool-use large language model (LLM) agents are increasingly deployed to support sensitive workflows, relying on tool calls for retrieval, external API access, and session memory management. While prior research has examined various threats, the risk of systematic data exfiltration by backdoored agents remains underexplored.
Open summary page · arXiv · PDF
#4 π^2: Structure-Originated Reasoning Data Improves Long-Context Reasoning Ability of Large Language Models
Score: 22.4
Matched keywords: fine-tuning, large language models, reasoning
Categories: cs.CL, cs.AI, cs.LG
Compressed abstract: We study a pipeline that curates reasoning data from initial structured data for improving long-context reasoning in large language models (LLMs). Our approach, ^2, constructs high-quality reasoning data through rigorous QA curation: 1) extracting and expanding tables from Wikipedia, 2) from the collected tables and relevant context, generating realistic and multi-hop analytical reasoning questions whose answers are…
Open summary page · arXiv · PDF
#5 MMORF: A Multi-agent Framework for Designing Multi-objective Retrosynthesis Planning Systems
Score: 22.9
Matched keywords: agent, agent framework, benchmark, multi-agent
Categories: cs.AI, cs.CL
Compressed abstract: Multi-objective retrosynthesis planning is a critical chemistry task requiring dynamic balancing of quality, safety, and cost objectives. Language model-based multi-agent systems (MAS) offer a promising approach for this task: leveraging interactions of specialized agents to incorporate multiple objectives into retrosynthesis planning.
Open summary page · arXiv · PDF
#6 FLARE: Agentic Coverage-Guided Fuzzing for LLM-Based Multi-Agent Systems
Score: 26.0
Matched keywords: agent, llm, multi-agent
Categories: cs.SE
Compressed abstract: Multi-Agent LLM Systems (MAS) have been adopted to automate complex human workflows by breaking down tasks into subtasks. However, due to the non-deterministic behavior of LLM agents and the intricate interactions between agents, MAS applications frequently encounter failures, including infinite loops and failed tool invocations.
Open summary page · arXiv · PDF
#7 A Multi-Agent Approach to Validate and Refine LLM-Generated Personalized Math Problems
Score: 34.5
Matched keywords: agent, agent framework, large language models, llm, multi-agent
Categories: cs.CY
Compressed abstract: Students benefit from math problems contextualized to their interests. Large language models (LLMs) offer promise for efficient personalization at scale.
Open summary page · arXiv · PDF
#8 Compiled AI: Deterministic Code Generation for LLM-Based Workflow Automation
Score: 31.4
Matched keywords: ai, code generation, large language models, llm, prompt, token
Categories: cs.SE, cs.AI
Compressed abstract: We study compiled AI, a paradigm in which large language models generate executable code artifacts during a compilation phase, after which workflows execute deterministically without further model invocation. This paradigm has antecedents in prior work on declarative pipeline optimization (DSPy) and hybrid neural-symbolic planning (LLM+P); our contribution is a systems-oriented study of its application to high-stake…
Open summary page · arXiv · PDF
#9 Reasoning Through Chess: How Reasoning Evolves from Data Through Fine-Tuning and Reinforcement Learning
Score: 17.2
Matched keywords: fine-tuning, reasoning
Categories: cs.LG, cs.AI
Compressed abstract: How can you get a language model to reason in a task it natively struggles with? We study how reasoning evolves in a language model -- from supervised fine-tuning (SFT) to reinforcement learning (RL) -- by analyzing how a set of theoretically-inspired datasets impacts language model performance in chess.
Open summary page · arXiv · PDF
#10 LLM Reasoning as Trajectories: Step-Specific Representation Geometry and Correctness Signals
Score: 25.4
Matched keywords: large language models, llm, reasoning
Categories: cs.CL, cs.AI, cs.LG
Compressed abstract: This work characterizes large language models' chain-of-thought generation as a structured trajectory through representation space. We show that mathematical reasoning traverses functionally ordered, step-specific subspaces that become increasingly separable with layer depth.
Open summary page · arXiv · PDF