#1 Harness as an Asset: Enforcing Determinism via the Convergent AI Agent Framework (CAAF)
Score: 55.0
Matched keywords: agent, agent framework, ai, ai agent, benchmark, harness, large language models, multi-agent, prompt
Categories: cs.AI, cs.LG
Compressed abstract: Large Language Models (LLMs) produce a controllability gap in safety-critical engineering: even low rates of undetected constraint violations render a system undeployable. Current orchestration paradigms suffer from sycophantic compliance, context attention decay [Liu et al., 2024], and stochastic oscillation during self-correction [Huang et al., 2024].
Open summary page · arXiv · PDF
#2 HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
Score: 37.2
Matched keywords: agent, agent framework, benchmark, code generation, large language models, multi-agent, repository-level
Categories: cs.CL
Compressed abstract: Recent advances in large language models have highlighted their potential to automate computational research, particularly reproducing experimental results. However, existing approaches still use fixed sequential agent pipelines with weak global coordination, which limits their robustness and overall performance.
Open summary page · arXiv · PDF
#3 Graph-of-Agents: A Graph-based Framework for Multi-Agent LLM Collaboration
Score: 27.2
Matched keywords: agent, llm, multi-agent
Categories: cs.AI
Compressed abstract: With an ever-growing zoo of LLMs and benchmarks, the need to orchestrate multiple models for improved task performance has never been more pressing. While frameworks like Mixture-of-Agents (MoA) attempt to coordinate LLMs, they often fall short in terms of (1) selecting relevant agents, (2) facilitating effective intra-agent communication, and (3) integrating responses efficiently.
Open summary page · arXiv · PDF
#4 AeroRAG: Structured Multimodal Retrieval-Augmented LLM for Fine-Grained Aerial Visual Reasoning
Score: 33.2
Matched keywords: benchmark, large language model, large language models, llm, multimodal, reasoning, retrieval-augmented, token
Categories: cs.CV
Compressed abstract: Despite recent progress in multimodal large language models (MLLMs), reliable visual question answering in aerial scenes remains challenging. In such scenes, task-critical evidence is often carried by small objects, explicit quantities, coarse locations, and inter-object relations, whereas conventional dense visual-token representations are not well aligned with these structured semantics.
Open summary page · arXiv · PDF
#5 Do LLM-derived graph priors improve multi-agent coordination?
Score: 33.5
Matched keywords: agent, ai, benchmark, large language models, llm, multi-agent
Categories: cs.LG
Compressed abstract: Multi-agent reinforcement learning (MARL) is crucial for AI systems that operate collaboratively in distributed and adversarial settings, particularly in multi-domain operations (MDO). A central challenge in cooperative MARL is determining how agents should coordinate: existing approaches must either hand-specify graph topology, rely on proximity-based heuristics, or learn structure entirely from environment interac…
Open summary page · arXiv · PDF
#6 MASS-RAG: Multi-Agent Synthesis Retrieval-Augmented Generation
Score: 34.9
Matched keywords: agent, large language models, multi-agent, rag, reasoning, retrieval-augmented
Categories: cs.CL
Compressed abstract: Large language models (LLMs) are widely used in retrieval-augmented generation (RAG) to incorporate external knowledge at inference time. However, when retrieved contexts are noisy, incomplete, or heterogeneous, a single generation process often struggles to reconcile evidence effectively.
Open summary page · arXiv · PDF
#7 Beyond Meta-Reasoning: Metacognitive Consolidation for Self-Improving LLM Reasoning
Score: 23.2
Matched keywords: large language models, llm, reasoning
Categories: cs.AI
Compressed abstract: Large language models (LLMs) have demonstrated strong reasoning capabilities, and as existing approaches for enhancing LLM reasoning continue to mature, increasing attention has shifted toward meta-reasoning as a promising direction for further improvement. However, most existing meta-reasoning methods remain episodic: they focus on executing complex meta-reasoning routines within individual instances, but ignore th…
Open summary page · arXiv · PDF
#8 LLM-AUG: Robust Wireless Data Augmentation with In-Context Learning in Large Language Models
Score: 33.6
Matched keywords: deep learning, diffusion, in-context learning, large language models, llm, machine learning
Categories: cs.LG
Compressed abstract: Data scarcity remains a fundamental bottleneck in applying deep learning to wireless communication problems, particularly in scenarios where collecting labeled Radio Frequency (RF) data is expensive, time-consuming, or operationally constrained. This paper proposes LLM-AUG, a data augmentation framework that leverages in-context learning in large language models (LLMs) to generate synthetic training samples directly…
Open summary page · arXiv · PDF
#9 Evaluating Multi-Hop Reasoning in RAG Systems: A Comparison of LLM-Based Retriever Evaluation Strategies
Score: 34.4
Matched keywords: large language models, llm, rag, reasoning, retrieval-augmented
Categories: cs.IR, cs.AI
Compressed abstract: Retrieval-augmented generation (RAG) enhances large language models (LLMs) with external knowledge to answer questions more accurately. However, research on evaluating RAG systems-particularly the retriever component-remains limited, as most existing work focuses on single-context retrieval rather than multi-hop queries, where individual contexts may appear irrelevant in isolation but are essential when combined.
Open summary page · arXiv · PDF
#10 Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification
Score: 16.4
Matched keywords: llm, reasoning
Categories: cs.CL, cs.AI, cs.LG, cs.PL
Compressed abstract: We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum.
Open summary page · arXiv · PDF