#10 Who Benefits from RAG? The Role of Exposure, Utility and Attribution Bias

Score: 20.4 | Matched keywords: large language models, llm, rag, retrieval-augmented

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles In this paper, we argue that three key factors are likely to influence query group fairness in RAG systems, namely: group exposure, i.e., the prevalence of documents from each group in the retrieved set, determined by the retriever; group utility, i.e., the extent to which documents from each group improve answer accuracy, reflecting retriever-generator interactions; and group attribution, i.e., the degree to which a generator consumes documents from each group when producing responses. Query group fairness concerns whether a RAG system is systematically more accurate for queries that relate to particular groups within a fairness category, or when the inclusion of the retriever component in RAG results in greater accuracy improvements for particular query groups. Retrieval-Augmented Generation (RAG) can improve the accuracy of generated responses from a Large Language Model (LLM) by supplementing the LLM with relevant documents that are retrieved from an external corpus.

The core proposal is While many studies have examined the effectiveness of RAG in terms of accuracy, little attention has been paid in the literature to beyond-accuracy aspects, such as fairness [?]. Query group fairness concerns whether a RAG system is systematically more accurate for queries that relate to particular groups within a fairness category, or when the inclusion of the retriever component in RAG (LLM) by supplementing the LLM with relevant documents that are retrieved from an external corpus. An underexplored aspect of fairness in RAG systems is query group fairness.

The empirical case is built around Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group. Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group. Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group. In this paper, we argue that three key factors are likely to influence query group fairness in RAG systems, namely: group exposure, i.e., the prevalence of documents from each group in the retrieved set, determined by the retriever; group utility, i.e., the extent to which documents from each group improve answer accuracy, reflecting retriever-generator interactions; and group attribution, i.e., the degree to which a generator consumes documents from each group when producing responses.

The central reported finding is Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group. Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group. In this paper, we argue that three key factors are likely to influence query group fairness in RAG systems, namely: group exposure, i.e., the prevalence of documents from each group in the retrieved set, determined by the retriever; group utility, i.e., the extent to which documents from each group improve answer accuracy, reflecting retriever-generator interactions; and group attribution, i.e., the degree to which a generator consumes documents from each group when producing responses. in greater accuracy improvements for particular query groups.

The paper also makes it clear that Across fairness categories, Pop emerges as the most challenging category for EA, consistently exhibiting high levels of unfairness in both settings and across both tasks for all topics. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group.
Most important supporting result: Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.
Important caution: Across fairness categories, Pop emerges as the most challenging category for EA, consistently exhibiting high levels of unfairness in both settings and across both tasks for all topics.

Problem definition

In this paper, we argue that three key factors are likely to influence query group fairness in RAG systems, namely: group exposure, i.e., the prevalence of documents from each group in the retrieved set, determined by the retriever; group utility, i.e., the extent to which documents from each group improve answer accuracy, reflecting retriever-generator interactions; and group attribution, i.e., the degree to which a generator consumes documents from each group when producing responses.
Query group fairness concerns whether a RAG system is systematically more accurate for queries that relate to particular groups within a fairness category, or when the inclusion of the retriever component in RAG results in greater accuracy improvements for particular query groups.
Retrieval-Augmented Generation (RAG) can improve the accuracy of generated responses from a Large Language Model (LLM) by supplementing the LLM with relevant documents that are retrieved from an external corpus.
Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.

Core idea & method

While many studies have examined the effectiveness of RAG in terms of accuracy, little attention has been paid in the literature to beyond-accuracy aspects, such as fairness [?].
Query group fairness concerns whether a RAG system is systematically more accurate for queries that relate to particular groups within a fairness category, or when the inclusion of the retriever component in RAG
(LLM) by supplementing the LLM with relevant documents that are retrieved from an external corpus.
An underexplored aspect of fairness in RAG systems is query group fairness.

Actual findings

Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group.
Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.

How the conclusion was reached

Step 1 — Proposed approach: While many studies have examined the effectiveness of RAG in terms of accuracy, little attention has been paid in the literature to beyond-accuracy aspects, such as fairness [?].
Step 2 — Evaluation setup or comparison basis: Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.
Step 3 — Main reported evidence: Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group.
Step 4 — Additional supporting or qualifying result: Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.
Step 5 — Claim boundary / limitation: Across fairness categories, Pop emerges as the most challenging category for EA, consistently exhibiting high levels of unfairness in both settings and across both tasks for all topics.

Experimental setup & results

Using our datasets, we first show the existence of the query group fairness issue in RAG systems, then examine how utility, exposure, and attribution scores of a group within a fairness category affect the accuracy or accuracy improvements of queries from that group.
Moreover, the accuracy improvements, defined as the difference in the effectiveness of a RAG system compared to the LLM alone, could be smaller for a query from the unpopular group.
In this paper, we argue that three key factors are likely to influence query group fairness in RAG systems, namely: group exposure, i.e., the prevalence of documents from each group in the retrieved set, determined by the retriever; group utility, i.e., the extent to which documents from each group improve answer accuracy, reflecting retriever-generator interactions; and group attribution, i.e., the degree to which a generator consumes documents from each group when producing responses.
in greater accuracy improvements for particular query groups.
In other words, since fewer demographic studies are about rural counties, compared to urban areas, the RAG system can be affected by either of the following two biases: (i) the generator may prioritise studies on larger cities when consuming the retrieved documents, or (ii) the rural counties may be unfairly underexposed to the generator due to fewer relevant documents in the retrieval results.
[1] studied group-level fairness in answer attribution (i.e., attributing responses to the sources of information in the retrieved set) and showed that explicitly mentioning document’s author (human vs.

Limitations & risks

Across fairness categories, Pop emerges as the most challenging category for EA, consistently exhibiting high levels of unfairness in both settings and across both tasks for all topics.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 논문에서는 세 가지 핵심 요소가 RAG 시스템의 쿼리 그룹 공정성에 영향을 미칠 가능성이 있다고 주장합니다. 즉, 그룹 노출, 즉 검색자가 결정한 검색 세트에 있는 각 그룹의 문서 보급률입니다. 그룹 유틸리티(즉, 검색기-생성기 상호 작용을 반영하여 각 그룹의 문서가 답변 정확도를 향상시키는 정도) 그룹 속성, 즉 생성자가 응답을 생성할 때 각 그룹의 문서를 소비하는 정도. 쿼리 그룹 공정성은 RAG 시스템이 공정성 범주 내의 특정 그룹과 관련된 쿼리에 대해 체계적으로 더 정확한지 또는 RAG에 검색기 구성 요소를 포함하면 특정 쿼리 그룹의 정확도가 더 크게 향상되는지 여부와 관련됩니다. RAG(검색 증강 생성)는 외부 코퍼스에서 검색된 관련 문서로 LLM을 보완하여 LLM(대규모 언어 모델)에서 생성된 응답의 정확성을 향상시킬 수 있습니다. 핵심 제안은 다음과 같습니다. 많은 연구가 정확성 측면에서 RAG의 효율성을 조사한 반면, 공정성 [?]과 같은 정확성 이상의 측면에 대해서는 문헌에서 거의 관심을 기울이지 않았습니다. 쿼리 그룹 공정성은 RAG 시스템이 공정성 범주 내의 특정 그룹과 관련된 쿼리에 대해 체계적으로 더 정확한지, 또는 외부 코퍼스에서 검색된 관련 문서로 LLM을 보완하여 RAG(LLM)에 검색기 구성 요소를 포함하는 경우에 관한 것입니다. RAG 시스템의 공정성에 대해 아직 탐구되지 않은 측면은 쿼리 그룹 공정성입니다. 경험적 사례는 LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상이 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다. 데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유용성, 노출 및 속성 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떤 영향을 미치는지 조사합니다. 더욱이, LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다. 이 논문에서 우리는 RAG 시스템의 쿼리 그룹 공정성에 영향을 미칠 수 있는 세 가지 주요 요소, 즉 그룹 노출, 즉 검색자가 결정한 검색 세트의 각 그룹 문서의 보급률, 즉 검색자에 의해 결정되는 세 가지 주요 요소가 있다고 주장합니다. 그룹 유틸리티(즉, 검색기-생성기 상호 작용을 반영하여 각 그룹의 문서가 답변 정확도를 향상시키는 정도) 그룹 속성, 즉 생성자가 응답을 생성할 때 각 그룹의 문서를 소비하는 정도. 보고된 핵심 결과는 데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유용성, 노출 및 귀속 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떻게 영향을 미치는지 조사하는 것입니다. 더욱이, LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 더 작을 수 있습니다. 인기 없는 그룹의 문의입니다. 이 논문에서 우리는 RAG 시스템의 쿼리 그룹 공정성에 영향을 미칠 수 있는 세 가지 주요 요소, 즉 그룹 노출, 즉 검색자가 결정한 검색 세트의 각 그룹 문서의 보급률, 즉 검색자에 의해 결정되는 세 가지 주요 요소가 있다고 주장합니다. 그룹 유틸리티(즉, 검색기-생성기 상호 작용을 반영하여 각 그룹의 문서가 답변 정확도를 향상시키는 정도) 그룹 속성, 즉 생성자가 응답을 생성할 때 각 그룹의 문서를 소비하는 정도. 특정 쿼리 그룹의 정확도가 더욱 향상되었습니다. 또한 이 문서에서는 공정성 범주 전반에 걸쳐 Pop이 EA의 가장 어려운 범주로 등장하여 모든 주제에 대한 두 설정과 작업 모두에서 지속적으로 높은 수준의 불공평성을 나타냄을 분명히 밝혔습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유틸리티, 노출 및 속성 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떤 영향을 미치는지 조사합니다.
가장 중요한 지원 결과: 또한 LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.
중요 주의 사항: 공정성 범주 전체에서 Pop은 EA의 가장 어려운 범주로 떠오르며, 모든 주제에 대한 두 설정과 작업 모두에서 지속적으로 높은 수준의 불공평성을 나타냅니다.

문제 정의

이 논문에서 우리는 RAG 시스템의 쿼리 그룹 공정성에 영향을 미칠 수 있는 세 가지 주요 요소, 즉 그룹 노출, 즉 검색자가 결정한 검색 세트의 각 그룹 문서의 보급률, 즉 검색자에 의해 결정되는 세 가지 주요 요소가 있다고 주장합니다. 그룹 유틸리티(즉, 검색기-생성기 상호 작용을 반영하여 각 그룹의 문서가 답변 정확도를 향상시키는 정도) 그룹 속성, 즉 생성자가 응답을 생성할 때 각 그룹의 문서를 소비하는 정도.
쿼리 그룹 공정성은 RAG 시스템이 공정성 범주 내의 특정 그룹과 관련된 쿼리에 대해 체계적으로 더 정확한지 또는 RAG에 검색기 구성 요소를 포함하면 특정 쿼리 그룹의 정확도가 더 크게 향상되는지 여부와 관련됩니다.
RAG(검색 증강 생성)는 외부 코퍼스에서 검색된 관련 문서로 LLM을 보완하여 LLM(대규모 언어 모델)에서 생성된 응답의 정확성을 향상시킬 수 있습니다.
더욱이, LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.

핵심 아이디어/방법

많은 연구에서 정확성 측면에서 RAG의 효과를 조사했지만 공정성[?]과 같은 정확성 이상의 측면에 대해서는 문헌에서 거의 관심을 기울이지 않았습니다.
쿼리 그룹 공정성은 RAG 시스템이 공정성 범주 내의 특정 그룹과 관련된 쿼리에 대해 체계적으로 더 정확한지 또는 RAG에 검색기 구성 요소를 포함하는 경우에 관한 것입니다.
(LLM) 외부 코퍼스에서 검색된 관련 문서로 LLM을 보완합니다.
RAG 시스템의 공정성에 대해 아직 탐구되지 않은 측면은 쿼리 그룹 공정성입니다.

실제 결과

데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유용성, 노출 및 속성 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떤 영향을 미치는지 조사합니다.
더욱이, LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.

결론이 나온 과정

1단계 - 제안된 접근 방식: 많은 연구에서 정확성 측면에서 RAG의 효과를 조사했지만, 문헌에서는 공정성[?]과 같은 정확성 이상의 측면에 거의 관심을 기울이지 않았습니다.
2단계 - 평가 설정 또는 비교 기준: 또한 LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.
3단계 — 보고된 주요 증거: 데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유용성, 노출 및 속성 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떻게 영향을 미치는지 조사합니다.
4단계 - 추가 지원 또는 적격 결과: 또한 LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.
5단계 — 주장 경계/제한: 공정성 범주 전체에서 Pop은 EA의 가장 어려운 범주로 떠오르며 모든 주제에 대한 두 설정과 작업 모두에서 지속적으로 높은 수준의 불공평성을 나타냅니다.

실험 설정/결과

데이터 세트를 사용하여 먼저 RAG 시스템에서 쿼리 그룹 공정성 문제의 존재를 확인한 다음 공정성 범주 내 그룹의 유용성, 노출 및 속성 점수가 해당 그룹의 쿼리의 정확성 또는 정확성 향상에 어떤 영향을 미치는지 조사합니다.
더욱이, LLM 단독과 비교하여 RAG 시스템의 효율성 차이로 정의되는 정확도 향상은 인기가 없는 그룹의 쿼리에 대해 더 작을 수 있습니다.
이 논문에서 우리는 RAG 시스템의 쿼리 그룹 공정성에 영향을 미칠 수 있는 세 가지 주요 요소, 즉 그룹 노출, 즉 검색자가 결정한 검색 세트의 각 그룹 문서의 보급률, 즉 검색자에 의해 결정되는 세 가지 주요 요소가 있다고 주장합니다. 그룹 유틸리티(즉, 검색기-생성기 상호 작용을 반영하여 각 그룹의 문서가 답변 정확도를 향상시키는 정도) 그룹 속성, 즉 생성자가 응답을 생성할 때 각 그룹의 문서를 소비하는 정도.
특정 쿼리 그룹의 정확도가 더욱 향상되었습니다.
즉, 도시 지역에 비해 농촌 카운티에 대한 인구통계학적 연구가 적기 때문에 RAG 시스템은 다음 두 가지 편향 중 하나에 의해 영향을 받을 수 있습니다. (i) 생성자가 검색된 문서를 소비할 때 대도시에 대한 연구의 우선순위를 지정할 수 있거나 (ii) 검색 결과에서 관련 문서가 적기 때문에 농촌 카운티가 생성자에 부당하게 노출이 부족할 수 있습니다.
[1]은 답변 귀속(즉, 검색된 세트의 정보 소스에 대한 응답 귀속)에서 그룹 수준 공정성을 연구하고 문서 작성자(인간 대 문서 작성자)를 명시적으로 언급하는 것을 보여주었습니다.

한계/리스크

공정성 범주 전체에서 Pop은 EA의 가장 어려운 범주로 떠오르며 모든 주제에 대한 두 설정과 작업 모두에서 지속적으로 높은 수준의 불공평성을 나타냅니다.