#6 Demographic Fairness in Multimodal LLMs: A Benchmark of Gender and Ethnicity Bias in Face Verification

Score: 14.0 | Matched keywords: benchmark, large language models, multimodal, reasoning

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication. Fairness in this context means that the system should produce similar error rates across demographic groups at an operating point; if one group has a substantially higher false match rate or false non-match rate than another, the system is said to be biased against that group. Unlike embedding-based systems, which require 1 [cs.CV] 26 Mar 2026 specialised training on face identity labels, MLLMs approach the task through visual question answering by relying on the general visual and reasoning abilities acquired during pretraining.

The core proposal is In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups. We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics. this task through visual prompting and rely on general visual and reasoning abilities. However, the demographic fairness of these models remains largely unexplored.

The empirical case is built around show that FaceLLM-8B, the only facespecialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups. show that FaceLLM-8B, the only facespecialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks.

The central reported finding is This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication. We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups. show that FaceLLM-8B, the only facespecialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks. Face recognition systems have been shown to perform unevenly across demographic groups defined by attributes such as ethnicity and gender, and studies have repeatedly found that certain groups tend to have higher error rates than others.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication.
Most important supporting result: We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

Problem definition

This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication.
Fairness in this context means that the system should produce similar error rates across demographic groups at an operating point; if one group has a substantially higher false match rate or false non-match rate than another, the system is said to be biased against that group.
Unlike embedding-based systems, which require 1 [cs.CV] 26 Mar 2026 specialised training on face identity labels, MLLMs approach the task through visual question answering by relying on the general visual and reasoning abilities acquired during pretraining.
Face recognition systems have been shown to perform unevenly across demographic groups defined by attributes such as ethnicity and gender, and studies have repeatedly found that certain groups tend to have higher error rates than others.

Core idea & method

In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups.
We measure verification accuracy with the Equal Error Rate and True Match Rate at multiple operating points per demographic group, and we quantify demographic disparity with four FMR-based fairness metrics.
this task through visual prompting and rely on general visual and reasoning abilities.
However, the demographic fairness of these models remains largely unexplored.

Actual findings

This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication.
We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

How the conclusion was reached

Step 1 — Proposed approach: In this paper, we present a benchmarking study that evaluates nine open-source MLLMs from six model families, ranging from 2B to 8B parameters, on the IJB-C and RFW face verification protocols across four ethnicity groups and two gender groups.
Step 2 — Evaluation setup or comparison basis: show that FaceLLM-8B, the only facespecialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks.
Step 3 — Main reported evidence: This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication.
Step 4 — Additional supporting or qualifying result: We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.

Experimental setup & results

This embedding-based approach has reached high levels of accuracy on standard benchmarks [7, 17, 26, 38], and it is now used in applications ranging from border control and law enforcement to mobile device authentication.
We also note that the most accurate models are not necessarily the fairest and that models with poor overall accuracy can appear fair simply because they produce uniformly high error rates across all demographic groups.
show that FaceLLM-8B, the only facespecialised model in our study, substantially outperforms general-purpose MLLMs on both benchmarks.
Face recognition systems have been shown to perform unevenly across demographic groups defined by attributes such as ethnicity and gender, and studies have repeatedly found that certain groups tend to have higher error rates than others.
This capability opens up the possibility of using MLLMs for face verification: given two face images, the model can be prompted to judge whether they belong to the same person, and its response can be converted into a similarity score.
The bias patterns we observe differ from those commonly reported for traditional face recognition, with different groups being most affected depending on the benchmark and the model.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 이러한 임베딩 기반 접근 방식이 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르는 다양한 애플리케이션에 사용됩니다. 이러한 맥락에서 공정성은 시스템이 운영 지점에서 인구통계학적 그룹 전반에 걸쳐 유사한 오류율을 생성해야 함을 의미합니다. 한 그룹이 다른 그룹보다 허위 일치율이나 허위 비일치율이 상당히 높은 경우 시스템이 해당 그룹에 대해 편향되어 있다고 합니다. 얼굴 신원 레이블에 대한 전문 교육이 필요한 임베딩 기반 시스템과 달리 MLLM은 사전 교육 중에 획득한 일반적인 시각적 및 추론 능력에 의존하여 시각적 질문 답변을 통해 작업에 접근합니다. 핵심 제안은 다음과 같습니다. 이 문서에서는 4개의 인종 그룹과 2개의 성별 그룹에 걸쳐 IJB-C 및 RFW 얼굴 검증 프로토콜에 대해 2B에서 8B 매개변수 범위에 이르는 6개 모델 계열의 9개 오픈 소스 MLLM을 평가하는 벤치마킹 연구를 제시합니다. 인구통계학적 그룹당 여러 운영 지점에서 동일 오류율과 실제 일치율로 검증 정확도를 측정하고, 4가지 FMR 기반 공정성 지표로 인구통계학적 격차를 정량화합니다. 시각적 자극을 통해 이 작업을 수행하고 일반적인 시각적 및 추론 능력에 의존합니다. 그러나 이러한 모델의 인구통계학적 공정성은 아직까지 탐구되지 않은 상태로 남아 있습니다. 경험적 사례는 연구에서 유일한 얼굴 전문 모델인 FaceLLM-8B가 두 벤치마크 모두에서 범용 MLLM보다 훨씬 뛰어난 성능을 발휘한다는 사실을 중심으로 구축되었습니다. 이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르기까지 다양한 응용 프로그램에 사용됩니다. 또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 모든 인구 통계 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있습니다. 우리 연구에서 유일한 얼굴 전문 모델인 FaceLLM-8B가 두 벤치마크 모두에서 범용 MLLM보다 훨씬 뛰어난 성능을 발휘한다는 것을 보여줍니다. 보고된 주요 결과는 이 임베딩 기반 접근 방식이 표준 벤치마크에서 높은 수준의 정확도에 도달했으며[7, 17, 26, 38] 이제 국경 통제 및 법 집행에서 모바일 장치 인증에 이르는 응용 프로그램에 사용됩니다. 또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 모든 인구 통계 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있습니다. 우리 연구에서 유일한 얼굴 전문 모델인 FaceLLM-8B가 두 벤치마크 모두에서 범용 MLLM보다 훨씬 뛰어난 성능을 발휘한다는 것을 보여줍니다. 얼굴 인식 시스템은 인종, 성별 등의 속성으로 정의된 인구통계학적 그룹 전체에서 고르지 않게 작동하는 것으로 나타났으며, 연구에 따르면 특정 그룹은 다른 그룹보다 오류율이 더 높은 경향이 있는 것으로 반복적으로 발견되었습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만 청구 범위는 여전히 다음 사항을 고려하여 읽어야 합니다. 평가 설정 및 명시된 제한 사항.

핵심 결론

주요 시사점: 이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르기까지 다양한 애플리케이션에 사용됩니다.
가장 중요한 뒷받침 결과: 우리는 또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 단순히 모든 인구통계학적 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있다는 점에 주목합니다.

문제 정의

이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르기까지 다양한 응용 프로그램에 사용됩니다.
이러한 맥락에서 공정성은 시스템이 운영 지점에서 인구통계학적 그룹 전반에 걸쳐 유사한 오류율을 생성해야 함을 의미합니다. 한 그룹이 다른 그룹보다 허위 일치율이나 허위 비일치율이 상당히 높은 경우 시스템이 해당 그룹에 대해 편향되어 있다고 합니다.
얼굴 신원 레이블에 대한 전문 교육이 필요한 임베딩 기반 시스템과 달리 MLLM은 사전 교육 중에 획득한 일반적인 시각적 및 추론 능력에 의존하여 시각적 질문 답변을 통해 작업에 접근합니다.
얼굴 인식 시스템은 인종, 성별 등의 속성으로 정의된 인구통계학적 그룹 전체에서 고르지 않게 작동하는 것으로 나타났으며, 연구에 따르면 특정 그룹은 다른 그룹보다 오류율이 더 높은 경향이 있는 것으로 반복적으로 발견되었습니다.

핵심 아이디어/방법

본 논문에서는 4개의 인종 그룹과 2개의 성별 그룹에 걸쳐 IJB-C 및 RFW 얼굴 검증 프로토콜에서 2B에서 8B 매개변수 범위에 이르는 6개 모델 계열의 9개 오픈 소스 MLLM을 평가하는 벤치마킹 연구를 제시합니다.
인구통계학적 그룹당 여러 운영 지점에서 동일 오류율과 실제 일치율로 검증 정확도를 측정하고, 4가지 FMR 기반 공정성 지표로 인구통계학적 격차를 정량화합니다.
시각적 자극을 통해 이 작업을 수행하고 일반적인 시각적 및 추론 능력에 의존합니다.
그러나 이러한 모델의 인구통계학적 공정성은 아직까지 탐구되지 않은 상태로 남아 있습니다.

실제 결과

이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르기까지 다양한 응용 프로그램에 사용됩니다.
또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 모든 인구 통계 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있습니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 이 문서에서는 4개의 인종 그룹과 2개의 성별 그룹에 걸쳐 IJB-C 및 RFW 얼굴 검증 프로토콜에 대해 2B에서 8B 매개변수 범위에 이르는 6개 모델 계열의 9개 오픈 소스 MLLM을 평가하는 벤치마킹 연구를 제시합니다.
2단계 - 평가 설정 또는 비교 기준: 연구에서 유일한 얼굴 전문 모델인 FaceLLM-8B가 두 벤치마크 모두에서 범용 MLLM보다 훨씬 뛰어난 성능을 보여줍니다.
3단계 - 보고된 주요 증거: 이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행에서 모바일 장치 인증에 이르는 응용 프로그램에 사용됩니다.
4단계 — 추가 지원 또는 적격 결과: 또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 모든 인구통계 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있습니다.

실험 설정/결과

이 임베딩 기반 접근 방식은 표준 벤치마크[7, 17, 26, 38]에서 높은 수준의 정확도에 도달했으며 이제 국경 통제 및 법 집행부터 모바일 장치 인증에 이르기까지 다양한 응용 프로그램에 사용됩니다.
또한 가장 정확한 모델이 반드시 가장 공정한 것은 아니며 전체 정확도가 낮은 모델은 모든 인구 통계 그룹에 걸쳐 균일하게 높은 오류율을 생성하기 때문에 공정하게 보일 수 있습니다.
우리 연구에서 유일한 얼굴 전문 모델인 FaceLLM-8B가 두 벤치마크 모두에서 범용 MLLM보다 훨씬 뛰어난 성능을 발휘한다는 것을 보여줍니다.
얼굴 인식 시스템은 인종, 성별 등의 속성으로 정의된 인구통계학적 그룹 전체에서 고르지 않게 작동하는 것으로 나타났으며, 연구에 따르면 특정 그룹은 다른 그룹보다 오류율이 더 높은 경향이 있는 것으로 반복적으로 발견되었습니다.
이 기능은 얼굴 확인을 위해 MLLM을 사용할 수 있는 가능성을 열어줍니다. 두 개의 얼굴 이미지가 제공되면 모델은 해당 이미지가 동일한 사람인지 판단하라는 메시지를 표시할 수 있으며 해당 응답은 유사성 점수로 변환될 수 있습니다.
우리가 관찰한 편향 패턴은 기존 얼굴 인식에 대해 일반적으로 보고된 패턴과 다르며, 벤치마크와 모델에 따라 가장 큰 영향을 받는 그룹이 다릅니다.