#5 Measuring Faithfulness Depends on How You Measure: Classifier Sensitivity in LLM Chain-of-Thought Evaluation

Detailed Summary (EN)

Problem definition

[1] report that DeepSeek-R1 acknowledges sycophantic hints in its chain-of-thought only 39% of the time.
A regex-and-LLM pipeline, applied to the same model on comparable questions, yields 94.8%.
Although these three evaluations differ in prompt design, model version, and classification methodology, making direct numerical comparison imprecise, the magnitude of the spread illustrates how sensitive faithfulness estimates are to evaluation choices.
Chain-of-thought prompting [2] has become the dominant paradigm for eliciting reasoning from large language models, and a growing literature treats measured faithfulness rates as objective properties of models [3, 4, 1].

Core idea & method

Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively.
The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s κ ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction.
Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd.
The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior.

Experimental setup & results

Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively.
The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s κ ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction.
Classifier choice can also reverse model rankings: Qwen3.5-27B ranks 1st under the pipeline but 7th under the Sonnet judge; OLMo-3.1-32B moves in the opposite direction, from 9th to 3rd.
The root cause is that different classifiers operationalize related faithfulness constructs at different levels of stringency (lexical mention versus epistemic dependence), and these constructs yield divergent measurements on the same behavior.

Limitations & risks

MEASURING FAITHFULNESS DEPENDS ON HOW YOU MEASURE: CLASSIFIER SENSITIVITY IN LLM CHAIN-OF-THOUGHT EVALUATION Richard J.
Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.
On identical data, these classifiers produce overall faithfulness rates of 74.4%, 82.6%, and 69.7%, respectively.
The disagreements are systematic, not random: inter-classifier agreement measured by Cohen’s κ ranges from 0.06 (“slight”) for sycophancy hints to 0.42 (“moderate”) for grader hints, and the asymmetry is pronounced: for sycophancy, 883 cases are classified as faithful by the pipeline but unfaithful by the Sonnet judge, while only 2 go the other direction.

Read-like-fullpaper digest

This paper addresses [1] report that DeepSeek-R1 acknowledges sycophantic hints in its chain-of-thought only 39% of the time. The core method is Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters. Key empirical findings include Three classifiers (a regex-only detector, a two-stage regex-plus-LLM pipeline, and an independent Claude Sonnet 4 judge) are applied to 10,276 influenced reasoning traces from 12 open-weight models spanning 9 families and 7B to 1T parameters.

상세 요약 (KO)

문제 정의

[1] DeepSeek-R1은 생각의 사슬에서 단지 39%의 시간 동안 아첨하는 힌트를 인정한다고 보고합니다.
비슷한 질문에 대해 동일한 모델에 정규식 및 LLM 파이프라인을 적용하면 94.8%의 결과가 나옵니다.
이 세 가지 평가는 즉각적인 설계, 모델 버전 및 분류 방법이 다르기 때문에 직접적인 수치 비교가 부정확하지만 확산의 크기는 충실도 추정치가 평가 선택에 얼마나 민감한지를 보여줍니다.
일련의 사고 유도[2]는 대규모 언어 모델에서 추론을 이끌어내는 지배적인 패러다임이 되었으며, 측정된 충실도 비율을 모델의 객관적인 속성으로 취급하는 문헌이 늘어나고 있습니다[3, 4, 1].

핵심 아이디어/방법

3개의 분류기(정규식 전용 검출기, 2단계 정규식 플러스 LLM 파이프라인 및 독립적인 Claude Sonnet 4 판단기)가 9개 패밀리와 7B~1T 매개변수에 걸쳐 있는 12개 개방형 가중치 모델의 영향을 받은 10,276개의 추론 추적에 적용됩니다.
동일한 데이터에서 이러한 분류기는 각각 74.4%, 82.6%, 69.7%의 전체 충실도를 생성합니다.
불일치는 무작위가 아니라 체계적입니다. Cohen의 κ로 측정된 분류자 간 일치는 아첨 힌트의 경우 0.06("약함")부터 그레이더 힌트의 경우 0.42("보통")까지이며 비대칭성이 뚜렷합니다. 아첨의 경우 883개 사례가 파이프라인에서는 충실하지만 Sonnet 심사위원에서는 불성실한 것으로 분류되고 2개만 다른 방향으로 이동합니다.
분류기 선택은 모델 순위를 뒤집을 수도 있습니다. Qwen3.5-27B는 파이프라인에서는 1위지만 Sonnet 심사에서는 7위입니다. OLMo-3.1-32B는 반대 방향인 9번째에서 3번째로 이동합니다.
근본 원인은 서로 다른 분류자가 서로 다른 엄격성 수준(어휘적 언급 대 인식론적 의존성)에서 관련 충실성 구성을 조작하고 이러한 구성이 동일한 동작에 대해 다양한 측정을 산출한다는 것입니다.

실험 설정/결과

3개의 분류기(정규식 전용 검출기, 2단계 정규식 플러스 LLM 파이프라인 및 독립적인 Claude Sonnet 4 판단기)가 9개 패밀리와 7B~1T 매개변수에 걸쳐 있는 12개 개방형 가중치 모델의 영향을 받은 10,276개의 추론 추적에 적용됩니다.
동일한 데이터에서 이러한 분류기는 각각 74.4%, 82.6%, 69.7%의 전체 충실도를 생성합니다.
불일치는 무작위가 아니라 체계적입니다. Cohen의 κ로 측정된 분류자 간 일치는 아첨 힌트의 경우 0.06("약함")부터 그레이더 힌트의 경우 0.42("보통")까지이며 비대칭성이 뚜렷합니다. 아첨의 경우 883개 사례가 파이프라인에서는 충실하지만 Sonnet 심사위원에서는 불성실한 것으로 분류되고 2개만 다른 방향으로 이동합니다.
분류기 선택은 모델 순위를 뒤집을 수도 있습니다. Qwen3.5-27B는 파이프라인에서는 1위지만 Sonnet 심사에서는 7위입니다. OLMo-3.1-32B는 반대 방향인 9번째에서 3번째로 이동합니다.
근본 원인은 서로 다른 분류자가 서로 다른 엄격성 수준(어휘적 언급 대 인식론적 의존성)에서 관련 충실성 구성을 조작하고 이러한 구성이 동일한 동작에 대해 다양한 측정을 산출한다는 것입니다.

한계/리스크

충실도 측정은 측정 방법에 따라 다릅니다. LLM 사고 사슬 평가의 분류기 민감도 Richard J.
3개의 분류기(정규식 전용 검출기, 2단계 정규식 플러스 LLM 파이프라인 및 독립적인 Claude Sonnet 4 판단기)가 9개 패밀리와 7B~1T 매개변수에 걸쳐 있는 12개 개방형 가중치 모델의 영향을 받은 10,276개의 추론 추적에 적용됩니다.
동일한 데이터에서 이러한 분류기는 각각 74.4%, 82.6%, 69.7%의 전체 충실도를 생성합니다.
불일치는 무작위가 아니라 체계적입니다. Cohen의 κ로 측정된 분류자 간 일치는 아첨 힌트의 경우 0.06("약함")부터 그레이더 힌트의 경우 0.42("보통")까지이며 비대칭성이 뚜렷합니다. 아첨의 경우 883개 사례가 파이프라인에서는 충실하지만 Sonnet 심사위원에서는 불성실한 것으로 분류되고 2개만 다른 방향으로 이동합니다.

전체 논문 읽은 느낌 요약

이 백서에서는 DeepSeek-R1이 생각의 사슬에서 단지 39%의 시간 동안 아첨하는 힌트를 인식한다는 [1] 보고서를 다루고 있습니다. 핵심 방법은 세 가지 분류기(정규식 전용 검출기, 2단계 정규식과 LLM 파이프라인, 독립적인 Claude Sonnet 4 판단기)가 9개 패밀리와 7B~1T 매개변수에 걸쳐 있는 12개 개방형 가중치 모델의 영향을 받은 10,276개의 추론 추적에 적용됩니다. 주요 경험적 결과에는 3개의 분류기(정규식 전용 검출기, 2단계 정규식과 LLM 파이프라인, 독립적인 Claude Sonnet 4 심사위원)가 9개 패밀리와 7B~1T 매개변수에 걸쳐 있는 12개 공개 가중치 모델의 영향을 받은 10,276개의 추론 추적에 적용됩니다.