#5 AuthorityBench: Benchmarking LLM Authority Perception for Reliable Retrieval-Augmented Generation

Score: 24.6 | Matched keywords: benchmark, large language models, llm, rag, retrieval-augmented

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval. To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10 K web domains with PageRank-based authority), EntityAuth (22 K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation). Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation.

The core proposal is Therefore, we introduce a comprehensive benchmark to evaluate LLMs’ authority perception, i.e., AuthorityBench, containing three new datasets: DomainAuth, annotated with PageRank-based authority scores over 10K web domains; EntityAuth, covering 22K entities across three domains w In practical retrieval processes for RAG systems, the sources often comprise diverse corpora, where a source with low authority can lead the system to generate misinformation (Schlichtkrull, 2024), especially for queries in critical areas like health and politics. If the RAG cannot discern the difference in their authority and instead favors the latter simply because its prose is more fluent or persuasive, it can produce misleading or even harmful responses. (2) Entity authority: Assertions attributed to recognized experts, institutions, or officeholders are generally considered more trustworthy than identical statements from unknown individuals.

The empirical case is built around For instance, LLMs inherently associate higher authority with web domains ending in “.gov” in our experiments. Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with groundtruth authority, while ListJudge offers optimal cost-effectiveness. 1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases. When faced with conflicting information from sources of varying authority (e.g., a high-authority medical institution like Mayo Clinic vs.

The central reported finding is 1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases. lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer. When faced with conflicting information from sources of varying authority (e.g., a high-authority medical institution like Mayo Clinic vs.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: 1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases.
Most important supporting result: lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer.

Problem definition

Downstream experiments on RAG demonstrate that authority-guided filtering largely improves answer accuracy, validating the practical importance of authority perception for reliable knowledge retrieval.
To address this, we introduce AuthorityBench, a comprehensive benchmark for evaluating LLM authority perception comprising three datasets: DomainAuth (10 K web domains with PageRank-based authority), EntityAuth (22 K entities with popularity-based authority), and RAGAuth (120 queries with documents of varying authority for downstream evaluation).
Retrieval-Augmented Generation (RAG) enhances Large Language Models (LLMs) with external knowledge but remains vulnerable to low-authority sources that can propagate misinformation.

Core idea & method

Therefore, we introduce a comprehensive benchmark to evaluate LLMs’ authority perception, i.e., AuthorityBench, containing three new datasets: DomainAuth, annotated with PageRank-based authority scores over 10K web domains; EntityAuth, covering 22K entities across three domains w
In practical retrieval processes for RAG systems, the sources often comprise diverse corpora, where a source with low authority can lead the system to generate misinformation (Schlichtkrull, 2024), especially for queries in critical areas like health and politics.
If the RAG cannot discern the difference in their authority and instead favors the latter simply because its prose is more fluent or persuasive, it can produce misleading or even harmful responses.
(2) Entity authority: Assertions attributed to recognized experts, institutions, or officeholders are generally considered more trustworthy than identical statements from unknown individuals.
For the RQ1, we consider two types of authority: (1) source authority: This preference reflects the perceived reliability and reputation of the information source.
to mitigate outdated and hallucinations of Large Language Models (LLMs) by enabling models to incorporate real-time, domain-specific *Equal contribution.

Actual findings

1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases.
lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer.

How the conclusion was reached

Step 1 — Proposed approach: Therefore, we introduce a comprehensive benchmark to evaluate LLMs’ authority perception, i.e., AuthorityBench, containing three new datasets: DomainAuth, annotated with PageRank-based authority scores over 10K web domains; EntityAuth, covering 22K entities across three domains w
Step 2 — Evaluation setup or comparison basis: For instance, LLMs inherently associate higher authority with web domains ending in “.gov” in our experiments.
Step 3 — Main reported evidence: 1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases.
Step 4 — Additional supporting or qualifying result: lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer.

Experimental setup & results

Results show that ListJudge and PairJudge with PointScore output achieve the strongest correlation with groundtruth authority, while ListJudge offers optimal cost-effectiveness.
1Our code and benchmark can be found at https: //github.com/Trustworthy-Information-Access/ AuthorityBench information from external knowledge bases.
When faced with conflicting information from sources of varying authority (e.g., a high-authority medical institution like Mayo Clinic vs.
lower-authority lifestyle blogs), an LLM must correctly discern which source to trust to provide a reliable answer.
This underscores the crucial role of authority perception in ensuring the quality and reliability of RAG systems.
For instance, LLMs inherently associate higher authority with web domains ending in “.gov” in our experiments.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 RAG에 대한 다운스트림 실험을 다루면서 권위 기반 필터링이 답변 정확도를 크게 향상시키고 신뢰할 수 있는 지식 검색을 위한 권위 인식의 실질적인 중요성을 검증한다는 것을 보여줍니다. 이 문제를 해결하기 위해 우리는 DomainAuth(PageRank 기반 권한이 있는 10,000개의 웹 도메인), EntityAuth(인기 기반 권한이 있는 22,000개의 엔터티) 및 RAGAuth(다운스트림 평가를 위한 다양한 권한의 문서가 포함된 120개의 쿼리)의 세 가지 데이터 세트로 구성된 LLM 권한 인식을 평가하기 위한 포괄적인 벤치마크인 AuthorityBench를 소개합니다. RAG(검색 증강 생성)는 외부 지식을 사용하여 LLM(대규모 언어 모델)을 향상하지만 잘못된 정보를 전파할 수 있는 권한이 낮은 소스에는 여전히 취약합니다. 핵심 제안은 다음과 같습니다. 따라서 우리는 LLM의 권위 인식을 평가하기 위한 포괄적인 벤치마크, 즉 세 가지 새로운 데이터 세트를 포함하는 AuthorityBench를 소개합니다. 10K 웹 도메인에 대한 PageRank 기반 권위 점수로 주석이 달린 DomainAuth; 세 가지 도메인에 걸쳐 22,000개의 엔터티를 포괄하는 EntityAuth w RAG 시스템의 실제 검색 프로세스에서 소스는 종종 다양한 말뭉치로 구성되며, 권한이 낮은 소스는 특히 건강 및 정치와 같은 중요한 영역의 쿼리에 대해 시스템이 잘못된 정보를 생성하도록 유도할 수 있습니다(Schlichtkrull, 2024). RAG가 권위의 차이를 분별하지 못하고 단순히 산문이 더 유창하거나 설득력이 있다는 이유로 후자를 선호한다면 오해의 소지가 있거나 심지어 해로운 반응을 보일 수 있습니다. (2) 단체 권위: 인정받는 전문가, 기관 또는 공무원의 주장은 일반적으로 알려지지 않은 개인의 동일한 진술보다 더 신뢰할 수 있는 것으로 간주됩니다. 경험적 사례는 예를 들어 우리 실험에서 LLM은 본질적으로 ".gov"로 끝나는 웹 도메인과 더 높은 권한을 연관시킵니다. 결과에 따르면 PointScore 출력을 사용하는 ListJudge와 pairJudge는 실측 권위와 가장 강력한 상관관계를 달성하는 반면 ListJudge는 최적의 비용 효율성을 제공합니다. 1우리의 코드와 벤치마크는 외부 지식 기반의 https://github.com/Trustworthy-Information-Access/ AuthorityBench 정보에서 찾을 수 있습니다. 다양한 권한의 소스에서 나온 상충되는 정보(예: Mayo Clinic과 같은 권위 있는 의료 기관과 중앙 보고 결과는 1입니다. 우리의 코드와 벤치마크는 https://github.com/Trustworthy-Information-Access/AuthorityBench에서 찾을 수 있습니다. 외부 지식 기반의 정보입니다. 권한이 낮은 라이프스타일 블로그), LLM은 신뢰할 수 있는 답변을 제공하기 위해 신뢰할 수 있는 소스를 올바르게 식별해야 합니다. 다양한 권위의 출처(예: Mayo Clinic과 같은 권위 있는 의료 기관과 전체)의 상충되는 정보에 직면했을 때, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침되는 경우 가장 설득력이 있지만 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 1우리의 코드와 벤치마크는 https://github.com/Trustworthy-Information-Access/ 외부 지식 기반의 AuthorityBench 정보에서 찾을 수 있습니다.
가장 중요한 지원 결과: 권한이 낮은 라이프스타일 블로그) LLM은 신뢰할 수 있는 답변을 제공하기 위해 신뢰할 수 있는 소스를 올바르게 식별해야 합니다.

문제 정의

RAG에 대한 다운스트림 실험은 권위 기반 필터링이 답변 정확도를 크게 향상시켜 신뢰할 수 있는 지식 검색을 위한 권위 인식의 실질적인 중요성을 입증한다는 것을 보여줍니다.
이 문제를 해결하기 위해 우리는 DomainAuth(PageRank 기반 권한이 있는 10,000개의 웹 도메인), EntityAuth(인기 기반 권한이 있는 22,000개의 엔터티) 및 RAGAuth(다운스트림 평가를 위한 다양한 권한의 문서가 포함된 120개의 쿼리)의 세 가지 데이터 세트로 구성된 LLM 권한 인식을 평가하기 위한 포괄적인 벤치마크인 AuthorityBench를 소개합니다.
RAG(검색 증강 생성)는 외부 지식을 사용하여 LLM(대규모 언어 모델)을 향상하지만 잘못된 정보를 전파할 수 있는 권한이 낮은 소스에는 여전히 취약합니다.

핵심 아이디어/방법

따라서 우리는 LLM의 권위 인식을 평가하기 위한 포괄적인 벤치마크, 즉 세 가지 새로운 데이터 세트를 포함하는 AuthorityBench를 소개합니다. 10,000개의 웹 도메인에 대한 PageRank 기반 권위 점수로 주석이 달린 DomainAuth; 3개 도메인에 걸쳐 22,000개의 엔터티를 포괄하는 EntityAuth
RAG 시스템의 실제 검색 프로세스에서 소스는 종종 다양한 말뭉치로 구성되며, 권한이 낮은 소스는 특히 건강 및 정치와 같은 중요한 영역에 대한 쿼리의 경우 시스템이 잘못된 정보를 생성하도록 유도할 수 있습니다(Schlichtkrull, 2024).
RAG가 권위의 차이를 분별하지 못하고 단순히 산문이 더 유창하거나 설득력이 있다는 이유로 후자를 선호한다면 오해의 소지가 있거나 심지어 해로운 반응을 보일 수 있습니다.
(2) 단체 권위: 인정받는 전문가, 기관 또는 공무원의 주장은 일반적으로 알려지지 않은 개인의 동일한 진술보다 더 신뢰할 수 있는 것으로 간주됩니다.
RQ1의 경우 두 가지 유형의 권한을 고려합니다. (1) 소스 권한: 이 선호도는 정보 소스의 인식된 신뢰성과 평판을 반영합니다.
모델에 실시간 도메인별 *동등 기여를 통합할 수 있도록 하여 LLM(대규모 언어 모델)의 구식 및 환각을 완화합니다.

실제 결과

1우리의 코드와 벤치마크는 외부 지식 기반의 https://github.com/Trustworthy-Information-Access/ AuthorityBench 정보에서 찾을 수 있습니다.
권한이 낮은 라이프스타일 블로그) LLM은 신뢰할 수 있는 답변을 제공하기 위해 신뢰할 수 있는 소스를 정확하게 식별해야 합니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 따라서 우리는 LLM의 권위 인식을 평가하기 위한 포괄적인 벤치마크, 즉 세 가지 새로운 데이터 세트를 포함하는 AuthorityBench를 도입합니다. 10,000개 웹 도메인에 대한 PageRank 기반 권위 점수로 주석이 달린 DomainAuth; 3개 도메인에 걸쳐 22,000개의 엔터티를 포괄하는 EntityAuth
2단계 — 평가 설정 또는 비교 기준: 예를 들어 LLM은 본질적으로 실험에서 ".gov"로 끝나는 웹 도메인과 더 높은 권한을 연관시킵니다.
3단계 — 보고된 주요 증거: 1우리의 코드와 벤치마크는 https://github.com/Trustworthy-Information-Access/ 외부 지식 기반의 AuthorityBench 정보에서 찾을 수 있습니다.
4단계 — 추가 지원 또는 적격 결과: 권한이 낮은 라이프스타일 블로그), LLM은 신뢰할 수 있는 답변을 제공하기 위해 신뢰할 수 있는 소스를 정확하게 식별해야 합니다.

실험 설정/결과

결과에 따르면 PointScore 출력을 사용하는 ListJudge와 pairJudge는 실측 권위와 가장 강력한 상관관계를 달성하는 반면 ListJudge는 최적의 비용 효율성을 제공합니다.
1우리의 코드와 벤치마크는 외부 지식 기반의 https://github.com/Trustworthy-Information-Access/ AuthorityBench 정보에서 찾을 수 있습니다.
다양한 권한을 가진 출처(예: Mayo Clinic과 같은 권위 있는 의료 기관과
권한이 낮은 라이프스타일 블로그) LLM은 신뢰할 수 있는 답변을 제공하기 위해 신뢰할 수 있는 소스를 정확하게 식별해야 합니다.
이는 RAG 시스템의 품질과 신뢰성을 보장하는 데 있어서 권위 인식의 중요한 역할을 강조합니다.
예를 들어, LLM은 본질적으로 우리 실험에서 ".gov"로 끝나는 웹 도메인과 더 높은 권한을 연관시킵니다.