#9 Evaluating the Reliability and Fidelity of Automated Judgment Systems of Large Language Models

Score: 25.3 | Matched keywords: large language model, large language models, llm, machine learning, prompt

Detailed Summary (EN)

Problem definition

The widespread deployment of LLMs as chatbots [36], virtual assistants in phones and cars [6], and to control cyber-physical systems (CPSs) [48], e.
g., in robotics [33] make them a powerful and ubiquitous tool.
This raises the need to assure their reliability, in particular facing increased interest of malicious actors to exploit vulnerabilities introduced by the usage of LLMs.
Due to the non-deterministic nature of LLMs, a formal verification is equally infeasible as fixed testing schemes that cannot deal with the freeform text outputs with static rule sets for tests [53].

Core idea & method

(LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs.
An LLM as judge is the combination of one model and one specifically engineered judge prompt that contains the criteria for the analysis.
The resulting automation of the analysis scales up the complex

Experimental setup & results

of the victim models’ free-form text outputs by faster and more consistent judgments compared to human reviewers.
Thus, quality and security assessments of LLMs can cover a wide range of the victim models’ use cases.
Being a comparably new technique, LLMs as judges lack a thorough investigation for their reliability and agreement to human judgment.
Our work evaluates the applicability of LLMs as automated quality assessors of victim LLMs.
We test the efficacy of 37 differently sized conversational LLMs in combination with 5 different judge prompts, the concept of a second-level judge, and 5 models fine-tuned for the task as assessors.

Limitations & risks

[3] employ a peer-review concept to find the best of multiple results by different judges.
[34] used the idea of a second-level judge to improve an initial judgment by the same LLM.
While we cannot include a comparative evaluation for all TABLE 1: Overview of evaluation datasets.
Dataset type is either a victim’s undesired (u) output or its correctness (c) Dataset Size Type Judge Task: Detecting content that.

Read-like-fullpaper digest

This paper addresses The widespread deployment of LLMs as chatbots [36], virtual assistants in phones and cars [6], and to control cyber-physical systems (CPSs) [48], e. The core method is (LLM) as judge evaluates the quality of victim Machine Learning (ML) models, specifically LLMs, by analyzing their outputs. Key empirical findings include of the victim models’ free-form text outputs by faster and more consistent judgments compared to human reviewers.

상세 요약 (KO)

문제 정의

LLM은 챗봇[36], 전화 및 자동차의 가상 비서[6], 사이버 물리 시스템(CPS) 제어 [48] 등으로 널리 배포됩니다.
예를 들어, 로봇공학에서는[33] 강력하고 보편적인 도구로 만듭니다.
이로 인해 신뢰성을 보장해야 할 필요성이 높아지고, 특히 LLM 사용으로 인해 발생하는 취약점을 악용하려는 악의적인 행위자의 관심이 높아졌습니다.
LLM의 비결정적 특성으로 인해 공식 검증은 테스트용 정적 규칙 세트를 사용하여 자유 형식 텍스트 출력을 처리할 수 없는 고정 테스트 방식과 마찬가지로 실행 불가능합니다[53].

핵심 아이디어/방법

(LLM)은 판사로서 피해자 기계 학습(ML) 모델, 특히 LLM의 출력을 분석하여 품질을 평가합니다.
심사위원으로서의 LLM은 하나의 모델과 분석 기준을 포함하는 특별히 설계된 심사위원 프롬프트의 조합입니다.
결과적인 분석 자동화로 인해 복합 단지가 확장됩니다.

실험 설정/결과

인간 검토자에 비해 더 빠르고 일관된 판단을 통해 피해자 모델의 자유 형식 텍스트 출력을 평가합니다.
따라서 LLM의 품질 및 보안 평가는 광범위한 피해자 모델의 사용 사례를 다룰 수 있습니다.
비교적 새로운 기술인 LLM은 판사로서 신뢰성과 인간 판단에 대한 동의에 대한 철저한 조사가 부족합니다.
우리의 작업은 피해자 LLM의 자동화된 품질 평가자로서 LLM의 적용 가능성을 평가합니다.
우리는 5개의 서로 다른 판사 프롬프트, 2단계 판사의 개념 및 평가자로서의 작업에 맞게 미세 조정된 5개의 모델을 결합하여 37개의 다양한 크기의 대화형 LLM의 효율성을 테스트합니다.

한계/리스크

[3] 동료 검토 개념을 사용하여 다양한 심사위원의 여러 결과 중에서 가장 좋은 결과를 찾습니다.
[34]는 동일한 LLM의 초기 판단을 개선하기 위해 2단계 판사의 아이디어를 사용했습니다.
모든 표 1: 평가 데이터세트 개요에 대한 비교 평가를 포함할 수는 없습니다.
데이터 세트 유형은 피해자의 원치 않는(u) 출력이거나 그 정확성입니다. (c) 데이터 세트 크기 유형 판단 작업: 콘텐츠 감지.

전체 논문 읽은 느낌 요약

이 논문에서는 챗봇[36], 전화 및 자동차의 가상 비서[6], 사이버 물리 시스템(CPS) [48] 제어 등의 LLM의 광범위한 배포에 대해 다룹니다. 판사가 출력을 분석하여 피해자 기계 학습(ML) 모델, 특히 LLM의 품질을 평가하는 핵심 방법은 (LLM)입니다. 주요 경험적 발견에는 인간 검토자에 비해 더 빠르고 일관된 판단을 통해 피해자 모델의 자유 형식 텍스트 출력이 포함됩니다.