#1 It's Not the Capability: Harness Sensitivity Is Non-Monotone Across LLM Agent Tiers
Score: 37.4
Matched keywords: agent, benchmark, harness, llm, reasoning
Categories: cs.AI, cs.CL
Compressed abstract: A prevalent assumption in LLM agent deployment holds that more structured harnesses universally improve reliability, and that higher-capability models need proportionally less structural guidance -- together implying a monotone inverse relationship between model capability tier and optimal harness complexity. We test this hypothesis through a controlled 432-run experiment crossing six models across four capability t…
Open summary page · arXiv · PDF
#2 UnityMAS-O: A General RL Optimization Framework for LLM-Based Multi-Agent Systems
Score: 34.7
Matched keywords: agent, code generation, llm, multi-agent, retrieval-augmented
Categories: cs.AI, cs.CL, cs.MA
Compressed abstract: LLM-based multi-agent systems decompose complex tasks into interacting roles, but most remain manually orchestrated by prompts, tools, and control rules, while agents are rarely optimized through a unified reinforcement learning interface. Existing RL post-training frameworks mainly target single-policy optimization and lack abstractions for user-defined multi-agent workflows, structured interaction, role-specific c…
Open summary page · arXiv · PDF
#3 AutoDFT: A Closed-Loop Multi-Agent Framework for Autonomous DFT Calculations
Score: 39.7
Matched keywords: agent, agent framework, benchmark, llm, multi-agent, reasoning
Categories: cond-mat.mtrl-sci, cs.AI, cs.CE
Compressed abstract: Density functional theory (DFT) serves as the basis for computational discovery in materials science and chemistry, yet each calculation demands extensive human effort: adjusting algorithms when convergence stalls, revising plans when unexpected physics emerges, and inserting steps as intermediate results reshape the problem. Existing LLM-based agents automate only the initial planning stage, producing a full execut…
Open summary page · arXiv · PDF
#4 ReMoE: Boosting Expert Reuse through Router Fine-Tuning in Memory-Constrained MoE LLM Inference
Score: 18.6
Matched keywords: fine-tuning, llm, token
Categories: cs.LG, cs.AI, cs.DC
Compressed abstract: Fine-grained Mixture-of-Experts (MoE) models sparsely activate only a subset of experts per token, reducing activated computation while maintaining high model capacity. However, in memory-constrained inference scenarios, only a small set of experts can be cached.
Open summary page · arXiv · PDF
#5 BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning
Score: 20.6
Matched keywords: large language models, llm, prompt, reasoning
Categories: cs.LG, stat.ML
Compressed abstract: Reinforcement learning with verifiable rewards has become a standard recipe for improving the reasoning abilities of large language models. Existing algorithms face a tradeoff between computational efficiency and sample efficiency in value estimation and policy learning.
Open summary page · arXiv · PDF
#6 FinHarness: An Inline Lifecycle Safety Harness for Finance LLM Agents
Score: 30.0
Matched keywords: agent, harness, llm, prompt
Categories: cs.CL
Compressed abstract: Finance LLM agents must simultaneously block prompt-induced unauthorized actions and approve legitimate multi-step business workflows. However, boundary filters often miss irreversible mid-trajectory tool calls, while post-hoc LLM judges perform auditing only after termination -- too late for intervention and at a computational cost that scales linearly with trace length.
Open summary page · arXiv · PDF
#7 TADDLE: A Tool-Augmented Agent for Detecting Deficient LLM-Generated Peer Reviews
Score: 22.2
Matched keywords: agent, benchmark, llm
Categories: cs.AI
Compressed abstract: LLM-generated peer reviews are increasingly common at major venues, yet their deficiencies are hard to detect because they are uniformly fluent and well-structured. Existing work either classifies authorship without judging quality, or scores quality with features designed for human-written reviews; no prior system detects deficiencies in LLM-generated reviews at the level of individual defect types.
Open summary page · arXiv · PDF
#8 Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
Score: 19.0
Matched keywords: fine-tuning, large language models, llm
Categories: cs.LG, cs.CR
Compressed abstract: Recent defenses for safeguarding open-weight large language models (LLMs) are intended to prevent adversarial usage. Underlying these defenses is an assumption that new harmful behavior is learned through fine-tuning rather than elicited by jailbreaking the model.
Open summary page · arXiv · PDF
#9 Reasoning, Code, or Both? How Large Language Models Handle Variations in Math Questions
Score: 18.4
Matched keywords: large language models, reasoning
Categories: cs.AI, cs.CL, cs.LG
Compressed abstract: Large Language Models (LLMs) achieve impressive accuracy on mathematical reasoning benchmarks, yet their performance drops when problems are modified with simple changes like different names or numbers. Code execution methods, which let models generate and run Python code instead of reasoning in natural language, have been proposed as a solution, but their effect on reasoning robustness (the ability to maintain accu…
Open summary page · arXiv · PDF
#10 Alignment Tuning for Large Language Models: A Data-Centric Lens on Alignment Data Pipelines
Score: 12.2
Matched keywords: alignment, large language models, prompt
Categories: cs.CL, cs.AI
Compressed abstract: Much of the alignment tuning literature is organized around optimization objectives, while the construction of alignment data is often treated implicitly. In this survey, we adopt a data centric perspective and reframe alignment tuning as a pipeline design problem.
Open summary page · arXiv · PDF