#1 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination

Score: 27.8 | Matched keywords: agent, alignment, large language models, llm, rag, retrieval-augmented

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles However, the standard outcome-level RL rewards lack the granularity required to supervise fine-grained factual consistency, leading to a fundamental misalignment between the scalar reward and the nuanced evidentiary requirements of RAG. This loss of granularity limits the ability to verify individual claims, providing insufficient signals to reinforce the meticulous evidentiary grounding required for data-intensive tasks. Consequently, unsupported intermediate claims can persist, as models are rewarded for correct final outcomes despite potentially hallucinated reasoning trajectories (Li & Ng, 2025).

The core proposal is To address this, we introduce Multi-Agent Reinforced self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry. While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation. Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems. This information-asymmetric pipeline, where the policy model plays all three roles, is optimized via Multi-Agent Reinforcement Learning (MARL) to achieve robust factual alignment.

The empirical case is built around Notably, an 8B-parameter LLM equipped with MARCH achieves performance parity with leading closed-source models on multiple hallucination benchmarks, while simultaneously delivering significant gains in general RAG-QA tasks. Notably, an 8B-parameter LLM equipped with MARCH achieves performance parity with leading closed-source models on multiple hallucination benchmarks, while simultaneously delivering significant gains in general RAG-QA tasks. Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm. In this state, verifiers tend to prioritize validating internal coherence over objective grounding against source evidence, leading to the unwarranted endorsement of erroneous claims.

The central reported finding is Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm. In this state, verifiers tend to prioritize validating internal coherence over objective grounding against source evidence, leading to the unwarranted endorsement of erroneous claims. These results establish a scalable and verifiable pathway toward trustworthy, agentic self-improvement for large language models.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm.

Problem definition

However, the standard outcome-level RL rewards lack the granularity required to supervise fine-grained factual consistency, leading to a fundamental misalignment between the scalar reward and the nuanced evidentiary requirements of RAG.
This loss of granularity limits the ability to verify individual claims, providing insufficient signals to reinforce the meticulous evidentiary grounding required for data-intensive tasks.
Consequently, unsupported intermediate claims can persist, as models are rewarded for correct final outcomes despite potentially hallucinated reasoning trajectories (Li & Ng, 2025).

Core idea & method

To address this, we introduce Multi-Agent Reinforced self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry.
While existing hallucination detection methods employ LLM-as-a-judge to verify LLM outputs against retrieved evidence, they suffer from inherent confirmation bias, where the verifier inadvertently reproduces the errors of the original generation.
Hallucination remains a critical bottleneck for large language models (LLMs), undermining their reliability in real-world applications, especially in Retrieval-Augmented Generation (RAG) systems.
This information-asymmetric pipeline, where the policy model plays all three roles, is optimized via Multi-Agent Reinforcement Learning (MARL) to achieve robust factual alignment.
By training this pipeline with multi-agent reinforcement learning (MARL), we enable the agents to co-evolve and optimize factual adherence.
MARCH orchestrates a collaborative pipeline of three specialized agents: a Solver, a Proposer, and a Checker.

Actual findings

Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm.

How the conclusion was reached

Step 1 — Proposed approach: To address this, we introduce Multi-Agent Reinforced self-Check for Hallucination (MARCH), a framework that enforces rigorous factual alignment by leveraging deliberate information asymmetry.
Step 2 — Evaluation setup or comparison basis: Notably, an 8B-parameter LLM equipped with MARCH achieves performance parity with leading closed-source models on multiple hallucination benchmarks, while simultaneously delivering significant gains in general RAG-QA tasks.
Step 3 — Main reported evidence: Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm.

Experimental setup & results

Notably, an 8B-parameter LLM equipped with MARCH achieves performance parity with leading closed-source models on multiple hallucination benchmarks, while simultaneously delivering significant gains in general RAG-QA tasks.
Experimentally, MARCH achieves a substantial 2 MARCH: Multi-Agent Reinforced Self-Check for LLM Hallucination reduction in hallucinations compared to its base model, without requiring additional human annotations or external fact-checking tools, thereby demonstrating the efficacy of our asymmetric collaborative reinforcement paradigm.
In this state, verifiers tend to prioritize validating internal coherence over objective grounding against source evidence, leading to the unwarranted endorsement of erroneous claims.
These results establish a scalable and verifiable pathway toward trustworthy, agentic self-improvement for large language models.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

그러나 이 문서에서는 표준 결과 수준 RL 보상에는 세밀한 사실적 일관성을 감독하는 데 필요한 세분성이 부족하여 스칼라 보상과 RAG의 미묘한 증거 요구 사항 간의 근본적인 불일치로 이어집니다. 이러한 세분성 상실로 인해 개별 주장을 검증하는 능력이 제한되어 데이터 집약적인 작업에 필요한 세심한 증거 기반을 강화하기에는 신호가 충분하지 않습니다. 결과적으로, 잠재적으로 환각적인 추론 궤적에도 불구하고 모델이 올바른 최종 결과에 대해 보상을 받기 때문에 지원되지 않는 중간 주장이 지속될 수 있습니다(Li & Ng, 2025). 핵심 제안은 이 문제를 해결하기 위해 의도적인 정보 비대칭성을 활용하여 엄격한 사실 얼라인먼트을 시행하는 프레임워크인 MARCH(Multi-Agent Reinforced Self-Check for Hallucination)를 도입하는 것입니다. 기존 환각 탐지 방법은 LLM을 판사로 사용하여 검색된 증거에 대해 LLM 출력을 검증하지만 검증자가 실수로 원본 세대의 오류를 재현하는 고유한 확인 편향으로 인해 어려움을 겪습니다. 환각은 대규모 언어 모델(LLM)의 중요한 병목 현상으로 남아 있으며, 특히 RAG(Retrieval-Augmented Generation) 시스템에서 실제 애플리케이션의 신뢰성을 약화시킵니다. 정책 모델이 세 가지 역할을 모두 수행하는 이 정보 비대칭 파이프라인은 MARL(Multi-Agent Reinforcement Learning)을 통해 최적화되어 강력한 사실적 얼라인먼트을 달성합니다. 경험적 사례는 특히 MARCH가 탑재된 8B 매개변수 LLM이 여러 환각 벤치마크에서 주요 비공개 소스 모델과 성능 패리티를 달성하는 동시에 일반 RAG-QA 작업에서 상당한 이득을 제공한다는 점을 중심으로 구축되었습니다. 특히, MARCH가 탑재된 8B 매개변수 LLM은 여러 환각 벤치마크에서 주요 비공개 소스 모델과 성능 패리티를 달성하는 동시에 일반적인 RAG-QA 작업에서 상당한 이점을 제공합니다. 실험적으로 MARCH는 추가 인간 주석이나 외부 사실 확인 도구를 필요로 하지 않고 기본 모델에 비해 환각 감소에 대한 LLM 환각에 대한 실질적인 2 MARCH: 다중 에이전트 강화 자가 검사를 달성하여 비대칭 협업 강화 패러다임의 효율성을 입증합니다. 이 상태에서 검증자는 소스 증거에 대한 객관적 근거보다 내부 일관성을 검증하는 데 우선순위를 두는 경향이 있어 잘못된 주장을 부당하게 승인하게 됩니다. 보고된 중앙 연구 결과는 실험적으로 MARCH가 3월 2일에 LLM 환각에 대한 다중 에이전트 강화 자가 검사를 달성하여 추가 인간 주석이나 외부 사실 확인 도구가 필요 없이 기본 모델에 비해 환각 감소를 달성함으로써 비대칭 협업 강화 패러다임의 효율성을 입증했다는 것입니다. 이 상태에서 검증자는 소스 증거에 대한 객관적 근거보다 내부 일관성을 검증하는 데 우선순위를 두는 경향이 있어 잘못된 주장을 부당하게 승인하게 됩니다. 이러한 결과는 대규모 언어 모델에 대한 신뢰할 수 있고 주체적인 자기 개선을 향한 확장 가능하고 검증 가능한 경로를 설정합니다. 전반적으로 이 논문은 다음 중 가장 설득력이 있습니다. 제안된 방법은 보고된 비교에 의해 직접적으로 뒷받침되지만 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 실험적으로 MARCH는 추가 인간 주석이나 외부 사실 확인 도구를 필요로 하지 않고 기본 모델에 비해 환각이 감소한 LLM 환각에 대한 실질적인 2 MARCH: 다중 에이전트 강화 자가 검사를 달성하여 비대칭 협업 강화 패러다임의 효율성을 보여줍니다.

문제 정의

그러나 표준 결과 수준 RL 보상에는 세밀한 사실적 일관성을 감독하는 데 필요한 세분성이 부족하여 스칼라 보상과 RAG의 미묘한 증거 요구 사항 간의 근본적인 불일치가 발생합니다.
이러한 세분성 상실로 인해 개별 주장을 검증하는 능력이 제한되어 데이터 집약적인 작업에 필요한 세심한 증거 기반을 강화하기에는 신호가 충분하지 않습니다.
결과적으로, 잠재적으로 환각적인 추론 궤적에도 불구하고 모델이 올바른 최종 결과에 대해 보상을 받기 때문에 지원되지 않는 중간 주장이 지속될 수 있습니다(Li & Ng, 2025).

핵심 아이디어/방법

이 문제를 해결하기 위해 우리는 고의적인 정보 비대칭성을 활용하여 엄격한 사실 얼라인먼트을 시행하는 프레임워크인 MARCH(Multi-Agent Reinforced Self-Check for Hallucination)를 도입합니다.
기존 환각 탐지 방법은 LLM을 판사로 사용하여 검색된 증거에 대해 LLM 출력을 검증하지만 검증자가 실수로 원본 세대의 오류를 재현하는 고유한 확인 편향으로 인해 어려움을 겪습니다.
환각은 대규모 언어 모델(LLM)의 중요한 병목 현상으로 남아 있으며, 특히 RAG(Retrieval-Augmented Generation) 시스템에서 실제 애플리케이션의 신뢰성을 약화시킵니다.
정책 모델이 세 가지 역할을 모두 수행하는 이 정보 비대칭 파이프라인은 MARL(Multi-Agent Reinforcement Learning)을 통해 최적화되어 강력한 사실적 얼라인먼트을 달성합니다.
MARL(다중 에이전트 강화 학습)을 통해 이 파이프라인을 교육함으로써 에이전트가 사실 준수를 공동 발전하고 최적화할 수 있습니다.
MARCH는 해결자(Solver), 제안자(Proposer), 검사자(Checker)라는 세 가지 전문 에이전트의 협업 파이프라인을 조율합니다.

실제 결과

실험적으로 MARCH는 추가 인간 주석이나 외부 사실 확인 도구를 필요로 하지 않고 기본 모델에 비해 환각 감소에 대한 LLM 환각에 대한 실질적인 2 MARCH: 다중 에이전트 강화 자가 검사를 달성하여 비대칭 협업 강화 패러다임의 효율성을 입증합니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 이 문제를 해결하기 위해 의도적인 정보 비대칭성을 활용하여 엄격한 사실 얼라인먼트을 시행하는 프레임워크인 MARCH(Multi-Agent Reinforced Self-Check for Hallucination)를 도입합니다.
2단계 - 평가 설정 또는 비교 기준: 특히 MARCH가 탑재된 8B 매개변수 LLM은 여러 환각 벤치마크에서 주요 비공개 소스 모델과 성능 패리티를 달성하는 동시에 일반적인 RAG-QA 작업에서 상당한 이점을 제공합니다.
3단계 — 보고된 주요 증거: 실험적으로 MARCH는 추가 인간 주석이나 외부 사실 확인 도구를 필요로 하지 않고 기본 모델에 비해 환각이 감소한 LLM 환각에 대한 실질적인 2 MARCH: 다중 에이전트 강화 자가 검사를 달성하여 비대칭 협업 강화 패러다임의 효율성을 입증했습니다.

실험 설정/결과

특히, MARCH가 탑재된 8B 매개변수 LLM은 여러 환각 벤치마크에서 주요 비공개 소스 모델과 성능 패리티를 달성하는 동시에 일반적인 RAG-QA 작업에서 상당한 이점을 제공합니다.
실험적으로 MARCH는 추가 인간 주석이나 외부 사실 확인 도구를 필요로 하지 않고 기본 모델에 비해 환각 감소에 대한 LLM 환각에 대한 실질적인 2 MARCH: 다중 에이전트 강화 자가 검사를 달성하여 비대칭 협업 강화 패러다임의 효율성을 입증합니다.
이 상태에서 검증자는 소스 증거에 대한 객관적 근거보다 내부 일관성을 검증하는 데 우선순위를 두는 경향이 있어 잘못된 주장을 부당하게 승인하게 됩니다.
이러한 결과는 대규모 언어 모델에 대한 신뢰할 수 있고 주체적인 자기 개선을 향한 확장 가능하고 검증 가능한 경로를 설정합니다.