#1 COAgents: Multi-Agent Framework to Learn and Navigate Routing Problems Search Space
Score: 25.2
Matched keywords: agent, agent framework, multi-agent
Categories: cs.AI
Compressed abstract: Although Vehicle Routing Problems (VRP) are essential to many real-world systems, they remain computationally intractable at scale due to their combinatorial complexity. Traditional heuristics rely on handcrafted rules for local improvements and occasional jumps to escape local minima, but often struggle to generalize across diverse instances.
Open summary page · arXiv · PDF
#2 Parallel LLM Reasoning for Bias-Resilient, Robust Conceptual Abstraction
Score: 20.4
Matched keywords: large language models, llm, reasoning
Categories: cs.CL, cs.AI, cs.LG
Compressed abstract: Large language models (LLMs) have been increasingly used to analyze text. However, they are often plagued with contextual reasoning limitations when analyzing long documents.
Open summary page · arXiv · PDF
#3 Heartbeat-Bound Hierarchical Credentials: Cryptographic Revocation for AI Agent Swarms
Score: 30.0
Matched keywords: agent, ai, ai agent, ai agents, llm, prompt
Categories: cs.CR, cs.AI, cs.MA
Compressed abstract: Autonomous AI agents that spawn sub-agent swarms create a safety gap: existing credential revocation mechanisms, OAuth~2.0 introspection, OCSP, and W3 C Status Lists, require network connectivity to a central authority, leaving ``zombie agents'' executing privileged operations for minutes to hours after operator shutdown. We present Heartbeat-Bound Hierarchical Credentials (HBHC), a cryptographic protocol that binds…
Open summary page · arXiv · PDF
#4 What Twelve LLM Agent Benchmark Papers Disclose About Themselves: A Pilot Audit and an Open Scoring Schema
Score: 30.0
Matched keywords: agent, benchmark, harness, llm
Categories: cs.LG
Compressed abstract: We read twelve well-known LLM agent benchmark papers and recorded, dimension by dimension, what each paper actually says about how its evaluation was run. The motivation came from a familiar frustration: two papers will report results on the same benchmark with the same model name and disagree, and you cannot tell why -- the scaffold, the sampling settings, the subset, or the evaluator version.
Open summary page · arXiv · PDF
#5 Reasoning-Trace Collapse: Evaluating the Loss of Explicit Reasoning During Fine-Tuning
Score: 16.0
Matched keywords: fine-tuning, reasoning
Categories: cs.LG
Compressed abstract: Explicit reasoning models are trained to produce intermediate reasoning traces before final answers, but downstream fine-tuning is often performed on ordinary instruction-response data that contains no such traces. We show that this mismatch can induce reasoning-trace collapse: a fine-tuned model continues to produce plausible final answers while losing the structurally valid explicit reasoning traces that made it a…
Open summary page · arXiv · PDF
#6 Multi-agent Collaboration with State Management
Score: 18.4
Matched keywords: agent, multi-agent
Categories: cs.MA, cs.AI, cs.CL, cs.LG, cs.SE
Compressed abstract: Recent advances in multi-agent systems have shown great potential for solving complex tasks. However, when multiple agents edit a shared codebase concurrently, their changes can silently conflict and inconsistent views lead to integration failures.
Open summary page · arXiv · PDF
#7 Terminal-World: Scaling Terminal-Agent Environments via Agent Skills
Score: 14.4
Matched keywords: agent, large language models
Categories: cs.CL, cs.AI
Compressed abstract: Terminal agents extend Large Language Models with the ability to execute tasks directly in command-line environments, but their progress is bottlenecked by the scarcity of high-quality training data. Existing approaches bootstrap from partial sources such as human-defined seeds or GitHub repositories to instantiate one component and then complete the rest, producing tasks confined to narrow seed distributions, envir…
Open summary page · arXiv · PDF
#8 Tracing the ongoing emergence of human-like reasoning in Large Language Models
Score: 22.4
Matched keywords: large language models, llm, reasoning
Categories: cs.CL, cs.AI
Compressed abstract: Humans effortlessly go beyond literal meanings: If you mow the lawn, I will give you fifty dollars, is typically understood as implying that the speaker will pay only if the lawn is mowed, whereas If you are hungry, there is pizza in the oven implies that pizza is available regardless of the hearers hunger. Large Language Models - LLMs - show human-like performance on many tasks, yet it remains unclear whether they…
Open summary page · arXiv · PDF
#9 Multi-Agent Reinforcement Learning for Safe Autonomous Driving Under Pedestrian Behavioral Uncertainty
Score: 17.2
Matched keywords: agent, multi-agent
Categories: cs.LG, cs.AI, cs.HC, cs.RO
Compressed abstract: Simulation-based testing of self-driving cars (SDCs) typically relies on scripted or simplified pedestrian models that do not capture the heterogeneity and uncertainty of real human crossing behavior. This limits the realism of safety assessments, especially in scenarios involving jaywalking, which is governed by latent personality traits that the vehicle cannot observe.
Open summary page · arXiv · PDF
#10 AgentAtlas: Beyond Outcome Leaderboards for LLM Agents
Score: 17.0
Matched keywords: agent, benchmark, large language model, llm, prompt
Categories: cs.AI, cs.CL, cs.LG, cs.SE
Compressed abstract: Large language model agents now act on codebases, browsers, operating systems, calendars, files, and tool ecosystems, but the benchmarks used to evaluate them are fragmented: each emphasizes a different unit of measurement (final task success, tool-call validity, repeated-pass consistency, trajectory safety, or attack robustness). A line of 2024-2025 work has converged on the diagnosis that a single accuracy column…
Open summary page · arXiv · PDF