#3 Measuring What Matters -- or What's Convenient?: Robustness of LLM-Based Scoring Systems to Construct-Irrelevant Factors

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15]. With this rise in LLM-based scoring solutions, there is a renewed focus on the robustness of these systems under adversarial conditions, especially with widespread public awareness of the limitations in the underlying technology (e.g., “hallucinations” [3]), which was not the case with prior modes of automated scoring. Over time, methods for automatically evaluating written work have evolved from using handcrafted features (e.g., type-token ratios, part-of-speech tagging) and simple models [10] to more complex approaches including neural networks and transformer-based models [10, 15].

The core proposal is Automated Scoring System We developed an automated scoring system for this assessment that used a dual-architecture LLM-as-a-Judge feature extraction component together with clear-box regression algorithms similar to those described in Refs. Through simulations with varying sample sizes, we identiﬁed that a set of at least 500 responses would allow us to reliably calculate paired Cohen’s d eﬀect sizes with our desired precision (width of 95% conﬁdence interval < 0.2); we developed our sampling strategy with this heuristic in mind. Aligning with the constructs assessed by the instrument, we designed the scoring system to reward higher-level features related to personal and professional skills (such as the collaboration-related behavior noted above) rather than language proﬁciency or content-mastery. To receive a high score on this item, students were expected to demonstrate behaviors consistent with collaboration in their responses such as suggesting approaches to managing diﬀerences and resolving conﬂict.

The empirical case is built around To ensure all assessment items, as well as responses of varying quality, were represented in the sampled dataset, we selected responses from our larger dataset by stratifying by item and predicted score: we binned predicted scores using 10 equal width bins between the available scoring range of 1–5, then selected up to two responses at random from each scoring bin for each item. Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15]. This study contributes towards our understanding of the eﬀects of constructirrelevant factors on LLM-based scoring systems by investigating how the following factors inﬂuence scores produced by one such system: 1.

The central reported finding is Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].

Problem definition

Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].
With this rise in LLM-based scoring solutions, there is a renewed focus on the robustness of these systems under adversarial conditions, especially with widespread public awareness of the limitations in the underlying technology (e.g., “hallucinations” [3]), which was not the case with prior modes of automated scoring.
Over time, methods for automatically evaluating written work have evolved from using handcrafted features (e.g., type-token ratios, part-of-speech tagging) and simple models [10] to more complex approaches including neural networks and transformer-based models [10, 15].
Automatic evaluation of open-response text, including short-answer responses and essays, is one of the earliest and most widely explored applications of natural language processing and artiﬁcial intelligence (AI) in education.

Core idea & method

Automated Scoring System We developed an automated scoring system for this assessment that used a dual-architecture LLM-as-a-Judge feature extraction component together with clear-box regression algorithms similar to those described in Refs.
Through simulations with varying sample sizes, we identiﬁed that a set of at least 500 responses would allow us to reliably calculate paired Cohen’s d eﬀect sizes with our desired precision (width of 95% conﬁdence interval < 0.2); we developed our sampling strategy with this heuristic in mind.
Aligning with the constructs assessed by the instrument, we designed the scoring system to reward higher-level features related to personal and professional skills (such as the collaboration-related behavior noted above) rather than language proﬁciency or content-mastery.
To receive a high score on this item, students were expected to demonstrate behaviors consistent with collaboration in their responses such as suggesting approaches to managing diﬀerences and resolving conﬂict.
To ensure all assessment items, as well as responses of varying quality, were represented in the sampled dataset, we selected responses from our larger dataset by stratifying by item and predicted score: we binned predicted scores using 10 equal width bins between the available scoring range of 1–5, then selected up to two responses at random from each scoring bin for each item.
Our automated scoring system did not produce scores in certain scoring bins for certain items (e.g., the model may not have predicted any scores in the (1.4, 1.8] bin for an item), hence our sampled dataset included less than the total of 600 responses that would have been expected had all scoring bins been attainable for all 30 items.

Actual findings

Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].

How the conclusion was reached

Step 1 — Proposed approach: Automated Scoring System We developed an automated scoring system for this assessment that used a dual-architecture LLM-as-a-Judge feature extraction component together with clear-box regression algorithms similar to those described in Refs.
Step 2 — Evaluation setup or comparison basis: To ensure all assessment items, as well as responses of varying quality, were represented in the sampled dataset, we selected responses from our larger dataset by stratifying by item and predicted score: we binned predicted scores using 10 equal width bins between the available scoring range of 1–5, then selected up to two responses at random from each scoring bin for each item.
Step 3 — Main reported evidence: Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].

Experimental setup & results

Chief among these construct-irrelevant factors is text length; early studies involving simple models demonstrated that repeating sentences or entire paragraphs within an essay could artiﬁcially inﬂate scores predicted by these models [14], while more recent studies have demonstrated that this issue persists even in transformer-based scoring systems [15].
This study contributes towards our understanding of the eﬀects of constructirrelevant factors on LLM-based scoring systems by investigating how the following factors inﬂuence scores produced by one such system: 1.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15]. LLM 기반 채점 솔루션이 증가함에 따라, 특히 이전 자동 채점 모드에서는 그렇지 않았던 기본 기술의 한계(예: "환각"[3])에 대한 대중의 인식이 널리 확산되면서 적대적인 조건에서 이러한 시스템의 견고성에 다시 초점이 맞춰졌습니다. 시간이 지남에 따라 서면 작업을 자동으로 평가하는 방법은 손으로 만든 기능(예: 유형-토큰 비율, 품사 태깅) 및 간단한 모델[10]을 사용하는 것에서 신경망 및 변환기 기반 모델[10, 15]을 포함한 보다 복잡한 접근 방식으로 발전했습니다. 핵심 제안은 자동 채점 시스템입니다. 우리는 Refs에 설명된 것과 유사한 클리어 박스 회귀 알고리즘과 함께 이중 아키텍처 LLM-as-a-Judge 기능 추출 구성 요소를 사용하는 이 평가를 위한 자동 채점 시스템을 개발했습니다. 다양한 표본 크기를 사용한 시뮬레이션을 통해 우리는 최소 500개의 응답 집합을 통해 원하는 정밀도(폭 95% 신뢰 구간 < 0.2)로 쌍을 이루는 Cohen의 결함 크기를 안정적으로 계산할 수 있음을 확인했습니다. 우리는 이러한 경험적 방법을 염두에 두고 샘플링 전략을 개발했습니다. 도구가 평가한 구성에 맞춰 우리는 언어 능력이나 콘텐츠 숙달보다는 개인적 및 직업적 기술(위에서 언급한 협업 관련 행동 등)과 관련된 더 높은 수준의 기능을 보상하도록 채점 시스템을 설계했습니다. 이 항목에서 높은 점수를 받기 위해 학생들은 차이를 관리하고 갈등을 해결하기 위한 접근 방식을 제안하는 등 협력에 부합하는 행동을 보여야 했습니다. 경험적 사례는 모든 평가 항목과 다양한 품질의 응답이 샘플링된 데이터 세트에 표시되도록 하기 위해 항목 및 예측 점수별로 계층화하여 더 큰 데이터 세트에서 응답을 선택했습니다. 사용 가능한 점수 범위 1~5 사이에 10개의 동일한 너비 저장소를 사용하여 예측 점수를 분류한 다음 각 항목에 대한 각 점수 저장소에서 무작위로 최대 2개의 응답을 선택했습니다. 이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15]. 이 연구는 다음 요소가 그러한 시스템에서 생성된 점수에 어떻게 영향을 미치는지 조사함으로써 LLM 기반 채점 시스템에서 구성과 관련 없는 요소의 효과를 이해하는 데 도움이 됩니다. 1. 보고된 중심 결과는 이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 문장이나 전체 단락이 반복되는 것으로 나타났습니다. 에세이는 이러한 모델에 의해 예측된 점수를 인위적으로 부풀릴 수 있지만[14], 최근 연구에 따르면 이 문제는 변환기 기반 채점 시스템에서도 지속되는 것으로 나타났습니다[15]. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15].

문제 정의

이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15].
LLM 기반 채점 솔루션이 증가함에 따라, 특히 이전 자동 채점 모드에서는 그렇지 않았던 기본 기술의 한계(예: "환각"[3])에 대한 대중의 인식이 널리 확산되면서 적대적인 조건에서 이러한 시스템의 견고성에 다시 초점이 맞춰졌습니다.
시간이 지남에 따라 서면 작업을 자동으로 평가하는 방법은 손으로 만든 기능(예: 유형-토큰 비율, 품사 태깅) 및 간단한 모델[10]을 사용하는 것에서 신경망 및 변환기 기반 모델[10, 15]을 포함한 보다 복잡한 접근 방식으로 발전했습니다.
단답형 응답과 에세이를 포함한 개방형 텍스트의 자동 평가는 교육 분야에서 자연어 처리 및 인공 지능(AI)을 적용하는 가장 초기이자 가장 널리 탐구된 응용 프로그램 중 하나입니다.

핵심 아이디어/방법

자동 채점 시스템 우리는 Refs에 설명된 것과 유사한 클리어 박스 회귀 알고리즘과 함께 이중 아키텍처 LLM-as-a-Judge 기능 추출 구성 요소를 사용하는 이 평가를 위한 자동 채점 시스템을 개발했습니다.
다양한 표본 크기를 사용한 시뮬레이션을 통해 우리는 최소 500개의 응답 집합을 통해 원하는 정밀도(폭 95% 신뢰 구간 < 0.2)로 쌍을 이루는 Cohen의 결함 크기를 안정적으로 계산할 수 있음을 확인했습니다. 우리는 이러한 경험적 방법을 염두에 두고 샘플링 전략을 개발했습니다.
도구가 평가한 구성에 맞춰 우리는 언어 능력이나 콘텐츠 숙달보다는 개인적 및 직업적 기술(위에서 언급한 협업 관련 행동 등)과 관련된 더 높은 수준의 기능을 보상하도록 채점 시스템을 설계했습니다.
이 항목에서 높은 점수를 받기 위해 학생들은 차이를 관리하고 갈등을 해결하기 위한 접근 방식을 제안하는 등 협력에 부합하는 행동을 보여야 했습니다.
모든 평가 항목과 다양한 품질의 응답이 샘플링된 데이터 세트에 표시되도록 하기 위해 항목 및 예측 점수별로 계층화하여 더 큰 데이터 세트에서 응답을 선택했습니다. 사용 가능한 점수 범위 1~5 사이에 10개의 동일한 너비 저장소를 사용하여 예측 점수를 분류한 다음 각 항목의 각 점수 저장소에서 무작위로 최대 2개의 응답을 선택했습니다.
우리의 자동 채점 시스템은 특정 항목에 대해 특정 채점 상자에서 점수를 생성하지 않았습니다(예: 모델이 항목에 대한 (1.4, 1.8] 상자의 점수를 예측하지 못했을 수 있음). 따라서 샘플링된 데이터 세트에는 모든 30개 항목에 대해 모든 채점 상자를 얻을 수 있었을 것으로 예상되는 총 600개 미만의 응답이 포함되었습니다.

실제 결과

이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15].

결론이 나온 과정

1단계 - 제안된 접근 방식: 자동 채점 시스템 우리는 Refs에 설명된 것과 유사한 클리어 박스 회귀 알고리즘과 함께 이중 아키텍처 LLM-as-a-Judge 기능 추출 구성 요소를 사용하는 이 평가를 위한 자동 채점 시스템을 개발했습니다.
2단계 — 평가 설정 또는 비교 기준: 모든 평가 항목과 다양한 품질의 응답이 샘플링된 데이터 세트에 표시되도록 하기 위해 항목 및 예측 점수별로 계층화하여 더 큰 데이터 세트에서 응답을 선택했습니다. 사용 가능한 점수 범위 1~5 사이에 10개의 동일한 너비 저장소를 사용하여 예측 점수를 분류한 다음 각 항목에 대한 각 점수 저장소에서 무작위로 최대 2개의 응답을 선택했습니다.
3단계 - 보고된 주요 증거: 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15].

실험 설정/결과

이러한 구성과 관련 없는 요소 중 가장 중요한 것은 텍스트 길이입니다. 단순 모델을 포함하는 초기 연구에서는 에세이 내에서 문장이나 전체 단락을 반복하면 이러한 모델에 의해 예측된 점수가 인위적으로 부풀려질 수 있음을 보여주었지만[14], 보다 최근의 연구에서는 이 문제가 변환기 기반 채점 시스템에서도 지속된다는 사실이 입증되었습니다[15].
이 연구는 다음 요소가 그러한 시스템에서 생성된 점수에 어떻게 영향을 미치는지 조사함으로써 LLM 기반 채점 시스템에서 구성과 무관한 요소의 효과를 이해하는 데 도움이 됩니다.