#2 ELT-Bench-Verified: Benchmark Quality Issues Underestimate AI Agent Capabilities

Score: 29.0 | Matched keywords: agent, ai, ai agent, ai agents, benchmark, large language models, llm

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles However, constructing these pipelines remains a highly manual, labor-intensive process requiring expertise across diverse data sources [2, 17], cloud warehouses like Snowflake [7], and transformation frameworks like dbt [8]. Modern organizations rely heavily on Extract-Load-Transform (ELT) pipelines—workflows that extract data from heterogeneous sources, This work is licensed under the Creative Commons BY-NC-ND 4.0 International License. Yet, the initial baseline results were stark: SWE-Agent with Claude Sonnet 3.5 [4] achieved only a 37% success rate on data extraction and loading and a mere 1% on data transformation.

The core proposal is Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ 𝜅= 0.85) to systematically audit benchmark quality. but upgrading only the underlying large language model reveals that the extraction and loading stage is largely solved, while transformation performance improves dramatically.

The empirical case is built around Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal. Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation. Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal. First, re-evaluating ELT-Bench with the same agent framework but upgrading only the underlying large language model reveals that the extraction and loading stage is largely solved, while transformation performance improves dramatically.

The central reported finding is Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation. Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal. First, re-evaluating ELT-Bench with the same agent framework but upgrading only the underlying large language model reveals that the extraction and loading stage is largely solved, while transformation performance improves dramatically. Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ 𝜅= 0.85) to systematically audit benchmark quality.

Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation.
Most important supporting result: Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.

Problem definition

However, constructing these pipelines remains a highly manual, labor-intensive process requiring expertise across diverse data sources [2, 17], cloud warehouses like Snowflake [7], and transformation frameworks like dbt [8].
Modern organizations rely heavily on Extract-Load-Transform (ELT) pipelines—workflows that extract data from heterogeneous sources, This work is licensed under the Creative Commons BY-NC-ND 4.0 International License.
Yet, the initial baseline results were stark: SWE-Agent with Claude Sonnet 3.5 [4] achieved only a 37% success rate on data extraction and loading and a mere 1% on data transformation.
Motivated by recent discoveries of pervasive annotation errors in text-to-SQL benchmarks [9], we conduct a systematic data quality audit of ELT-Bench.

Core idea & method

Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ 𝜅= 0.85) to systematically audit benchmark quality.
but upgrading only the underlying large language model reveals that the extraction and loading stage is largely solved, while transformation performance improves dramatically.

Actual findings

Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation.
Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.

How the conclusion was reached

Step 1 — Proposed approach: Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ 𝜅= 0.85) to systematically audit benchmark quality.
Step 2 — Evaluation setup or comparison basis: Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.
Step 3 — Main reported evidence: Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation.
Step 4 — Additional supporting or qualifying result: Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.

Experimental setup & results

Our results demonstrate that both rapid model improvement and benchmark quality issues contributed to a substantial underestimation of agent capabilities in the original evaluation.
Tasks are classified by error source—agent-attributable, benchmark-attributable, or mixed—and further stratified by mitigability, distinguishing errors addressable through evaluation refinements from those requiring ground truth column removal.
First, re-evaluating ELT-Bench with the same agent framework but upgrading only the underlying large language model reveals that the extraction and loading stage is largely solved, while transformation performance improves dramatically.
Second, we develop an Auditor-Corrector methodology that combines scalable LLM-driven root-cause analysis with rigorous human validation (inter-annotator agreement Fleiss’ 𝜅= 0.85) to systematically audit benchmark quality.
More broadly, our findings echo recent observations of pervasive annotation errors in textto-SQL benchmarks, suggesting that benchmark quality issues are a systemic problem across data engineering evaluation.
Based on these findings, we construct ELT-Bench-Verified, a revised benchmark with refined evaluation logic and substantial ground-truth revisioning.

Limitations & risks

상세 요약 (KO)

전체 논문 읽은 느낌 요약

그러나 이러한 파이프라인을 구성하는 것은 다양한 데이터 소스[2, 17], Snowflake[7]와 같은 클라우드 웨어하우스, dbt[8]와 같은 변환 프레임워크에 대한 전문 지식이 필요한 매우 수동적이고 노동 집약적인 프로세스로 남아 있습니다. 현대 조직은 이기종 소스에서 데이터를 추출하는 워크플로인 ELT(추출-로드-변환) 파이프라인에 크게 의존합니다. 이 작업은 Creative Commons BY-NC-ND 4.0 국제 라이선스에 따라 라이선스가 부여됩니다. 그러나 초기 기준 결과는 뚜렷했습니다. Claude Sonnet 3.5[4]를 사용하는 SWE-Agent는 데이터 추출 및 로딩에서 37%의 성공률과 데이터 변환에서 1%에 불과했습니다. 핵심 제안은 두 번째로, 벤치마크 품질을 체계적으로 감사하기 위해 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(주석자 간 합의 Fleiss의 𝜅= 0.85)을 결합하는 감사자-수정자 방법론을 개발합니다. 그러나 기본 대규모 언어 모델만 업그레이드하면 추출 및 로딩 단계가 대부분 해결되고 변환 성능이 크게 향상되는 것을 알 수 있습니다. 경험적 사례는 작업을 오류 소스(에이전트 기인, 벤치마크 기인 또는 혼합)별로 분류하고 완화 가능성에 따라 계층화하여 평가 개선을 통해 해결할 수 있는 오류와 실제 열 제거가 필요한 오류를 구별합니다. 우리의 결과는 빠른 모델 개선과 벤치마크 품질 문제가 원래 평가에서 에이전트 기능을 상당히 과소평가하는 데 기여했음을 보여줍니다. 작업은 에이전트에 의한 오류, 벤치마크에 의한 오류 또는 혼합 오류 소스별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결 가능한 오류와 실제 열 제거가 필요한 오류를 구별합니다. 첫째, 동일한 에이전트 프레임워크를 사용하여 ELT-Bench를 재평가하지만 기본 대규모 언어 모델만 업그레이드하면 추출 및 로딩 단계가 대부분 해결되고 변환 성능이 크게 향상되는 것으로 나타났습니다. 보고된 핵심 결과는 다음과 같습니다. 우리의 결과는 빠른 모델 개선과 벤치마크 품질 문제가 원래 평가에서 에이전트 기능을 상당히 과소평가하는 데 기여했음을 보여줍니다. 작업은 에이전트에 의한 오류, 벤치마크에 의한 오류 또는 혼합 오류 소스별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결 가능한 오류와 실제 열 제거가 필요한 오류를 구별합니다. 첫째, 동일한 에이전트 프레임워크를 사용하여 ELT-Bench를 재평가하지만 기본 대규모 언어 모델만 업그레이드하면 추출 및 로딩 단계가 대부분 해결되고 변환 성능이 크게 향상되는 것으로 나타났습니다. 둘째, 벤치마크 품질을 체계적으로 감사하기 위해 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(주석자 간 합의 Fleiss의 𝜅= 0.85)을 결합하는 감사자-수정자 방법론을 개발합니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 시사점: 우리의 결과는 신속한 모델 개선과 벤치마크 품질 문제가 원래 평가에서 상담사 능력을 상당히 과소평가하는 데 기여했음을 보여줍니다.
가장 중요한 지원 결과: 작업은 오류 소스(에이전트 기인, 벤치마크 기인 또는 혼합)별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결할 수 있는 오류와 실제 열 제거가 필요한 오류를 구별합니다.

문제 정의

그러나 이러한 파이프라인을 구성하는 것은 여전히 다양한 데이터 소스[2, 17], Snowflake[7]와 같은 클라우드 웨어하우스, dbt[8]와 같은 변환 프레임워크에 대한 전문 지식이 필요한 매우 수동적이고 노동 집약적인 프로세스입니다.
현대 조직은 이기종 소스에서 데이터를 추출하는 워크플로인 ELT(추출-로드-변환) 파이프라인에 크게 의존합니다. 이 작업은 Creative Commons BY-NC-ND 4.0 국제 라이선스에 따라 라이선스가 부여됩니다.
그러나 초기 기준 결과는 뚜렷했습니다. Claude Sonnet 3.5[4]를 사용하는 SWE-Agent는 데이터 추출 및 로딩에서 37%의 성공률과 데이터 변환에서 1%에 불과했습니다.
최근 text-to-SQL 벤치마크에서 만연한 주석 오류가 발견된 것을 계기로 우리는 ELT-Bench에 대한 체계적인 데이터 품질 감사를 실시합니다.

핵심 아이디어/방법

둘째, 벤치마크 품질을 체계적으로 감사하기 위해 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(주석자 간 합의 Fleiss의 𝜅= 0.85)을 결합하는 감사자-수정자 방법론을 개발합니다.
그러나 기본 대규모 언어 모델만 업그레이드하면 추출 및 로딩 단계가 대부분 해결되고 변환 성능이 크게 향상되는 것을 알 수 있습니다.

실제 결과

우리의 결과는 빠른 모델 개선과 벤치마크 품질 문제가 원래 평가에서 에이전트 기능을 상당히 과소평가하는 데 기여했음을 보여줍니다.
작업은 에이전트에 의한 오류, 벤치마크에 의한 오류 또는 혼합 오류 소스별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결 가능한 오류와 실제 열 제거가 필요한 오류를 구별합니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: 둘째, 벤치마크 품질을 체계적으로 감사하기 위해 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(주석자 간 합의 Fleiss의 𝜅= 0.85)을 결합하는 감사자-수정자 방법론을 개발합니다.
2단계 — 평가 설정 또는 비교 기준: 작업은 오류 소스(에이전트 기인, 벤치마크 기인 또는 혼합)별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결할 수 있는 오류와 실제 열 제거가 필요한 오류를 구별합니다.
3단계 - 보고된 주요 증거: 우리의 결과는 빠른 모델 개선과 벤치마크 품질 문제가 원래 평가에서 상담사 능력을 상당히 과소평가하는 데 기여했음을 보여줍니다.
4단계 — 추가 지원 또는 적격 결과: 작업은 오류 소스(에이전트 기인, 벤치마크 기인 또는 혼합)별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결할 수 있는 오류와 실제 열 제거가 필요한 오류를 구별합니다.

실험 설정/결과

우리의 결과는 빠른 모델 개선과 벤치마크 품질 문제가 원래 평가에서 에이전트 기능을 상당히 과소평가하는 데 기여했음을 보여줍니다.
작업은 에이전트에 의한 오류, 벤치마크에 의한 오류 또는 혼합 오류 소스별로 분류되고 완화 가능성에 따라 계층화되어 평가 개선을 통해 해결 가능한 오류와 실제 열 제거가 필요한 오류를 구별합니다.
첫째, 동일한 에이전트 프레임워크를 사용하여 ELT-Bench를 재평가하지만 기본 대규모 언어 모델만 업그레이드하면 추출 및 로딩 단계가 대부분 해결되고 변환 성능이 크게 향상되는 것으로 나타났습니다.
둘째, 벤치마크 품질을 체계적으로 감사하기 위해 확장 가능한 LLM 기반 근본 원인 분석과 엄격한 인간 검증(주석자 간 합의 Fleiss의 𝜅= 0.85)을 결합하는 감사자-수정자 방법론을 개발합니다.
보다 광범위하게, 우리의 연구 결과는 textto-SQL 벤치마크에서 만연한 주석 오류에 대한 최근 관찰을 반영하여 벤치마크 품질 문제가 데이터 엔지니어링 평가 전반에 걸쳐 시스템적인 문제임을 시사합니다.
이러한 결과를 바탕으로 우리는 세련된 평가 논리와 실질적인 실측 수정을 통해 수정된 벤치마크인 ELT-Bench-Verified를 구성합니다.