#1 Lean4 Agent: Formal Modeling and Verification for Agent Workflow and Trajectory
Score: 20.2
Matched keywords: agent, agent workflow, artificial intelligence, large language models
Categories: cs.AI, cs.LG, cs.LO, cs.SE
Compressed abstract: Equipping Large Language Models (LLMs) to execute reliable multi-step workflows has become a central challenge in artificial intelligence. Despite recent advances in LLMs' agentic capabilities, most agent systems still lack formal methods for specifying, verifying, and debugging their workflow and execution trajectories.
Open summary page · arXiv · PDF
#2 ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning
Score: 26.4
Matched keywords: benchmark, large language model, llm, reasoning
Categories: cs.CL, cs.AI, cs.LG
Compressed abstract: Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs.
Open summary page · arXiv · PDF
#3 DuMate-DeepResearch: An Auditable Multi-Agent System with Recursive Search and Rubric-Grounded Reasoning
Score: 23.2
Matched keywords: agent, multi-agent, reasoning
Categories: cs.AI
Compressed abstract: Deep Research (DR) has emerged as a new agentic paradigm to tackle complex, open-ended research tasks, demanding systems that can iteratively frame problems, acquire evidence, verify sources, and synthesize long-form reports. In practice, however, current DR systems are constrained by four interrelated limitations: long-horizon planning over an underspecified scope, the bottleneck of decomposing and scheduling such…
Open summary page · arXiv · PDF
#4 A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning
Score: 19.2
Matched keywords: large language models, llm, reasoning
Categories: cs.LG, cs.AI
Compressed abstract: The emergence of "Aha moments" in large language models, particularly DeepSeek-R1-0120, has raised the question of whether these systems genuinely reason or merely imitate the appearance of reasoning. We conduct a comprehensive empirical comparison between model and human reasoning across all 30 problems from AIME 2025, exhaustively annotating 10,247 reasoning steps into five functional categories: Analysis, Inferen…
Open summary page · arXiv · PDF
#5 MADRAG: Multi-Agent Debate with Retrieval-Augmented Generation for Training-Free Analytic Essay Scoring
Score: 40.6
Matched keywords: agent, llm, multi-agent, prompt, reasoning, retrieval-augmented
Categories: cs.MA, cs.CL
Compressed abstract: We present MADRAG, a training-free framework for analytic essay scoring that combines multi-agent reasoning with retrieval-augmented grounding. Unlike standard LLM-as-judge approaches, which are prone to bias and unstable scoring, MADRAG decomposes evaluation into an interactive process: an Advocate identifies strengths, a Skeptic critiques weaknesses, and a Judge aggregates their arguments into a final score.
Open summary page · arXiv · PDF
#6 Hierarchical Certified Semantic Commitment for Byzantine-Resilient LLM-Agent Collaboration
Score: 21.2
Matched keywords: agent, benchmark, llm
Categories: cs.MA, cs.AI, cs.DC
Compressed abstract: Byzantine collaboration among large-language-model agents requires a finality-control primitive: given delivered stochastic, structured natural-language proposals, the protocol must decide whether the round supports a commit, what kind of commit, or a typed safe abort. Naive aggregation hides this choice behind a single verdict; classical Byzantine fault tolerance hides it behind byte-identity that LLM proposals do…
Open summary page · arXiv · PDF
#7 From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning
Score: 19.2
Matched keywords: llm, reasoning
Categories: cs.CL
Compressed abstract: Reasoning prefixes shape the future trajectory of LLM problem solving, yet existing process reward models usually evaluate them through local step correctness. We argue that correctness is a useful but indirect proxy for the effect we ultimately care about: whether a prefix increases the probability of successful completion.
Open summary page · arXiv · PDF
#8 CAF-Gen: A Multi-Agent System for Enriching Argumentation Structures
Score: 30.9
Matched keywords: agent, agent framework, alignment, multi-agent, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: Formalizing complex reasoning from natural text is one of the central challenges in computational linguistics. It requires systems to understand not just keywords but also the context and complex reasoning embedded in a text.
Open summary page · arXiv · PDF
#9 RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning
Score: 21.2
Matched keywords: fine-tuning, large language models, reasoning
Categories: cs.LG, cs.CL
Compressed abstract: Supervised fine-tuning (SFT) is a prevailing method for adapting large language models to reasoning tasks by imitating offline expert demonstrations, often treating a single expert trajectory as the target behavior. However, reasoning is not simple path imitation: rigidly following one demonstrated solution may overfit to surface forms and suppress the model's own reasoning distribution.
Open summary page · arXiv · PDF
#10 The Fine-Tuning Trap: Evaluating Negative Transfer and the Role of PEFT in Sub-1 B Mathematical Reasoning
Score: 21.0
Matched keywords: benchmark, fine-tuning, in-context learning, reasoning
Categories: cs.LG, cs.AI
Compressed abstract: Deploying Small Language Models (SLMs) on edge devices requires efficient fine-tuning strategies that adapt models to new tasks without degrading their general capabilities. In this study, we benchmark five sub-1 B models (135 M-1 B) on mathematical reasoning tasks and uncover a critical vulnerability: Full Fine-Tuning (Full FT) actively harms performance in models under 300 M parameters, often dropping accuracy bel…
Open summary page · arXiv · PDF