#2 Reasoning Gets Harder for LLMs Inside A Dialogue

Score: 20.0 | Matched keywords: benchmark, large language models, llm, reasoning

Detailed Summary (EN)

Problem definition

Large Language Models (LLMs) have recently demonstrated promising results on a number of complex reasoning benchmarks (DeepSeek-AI, 2025; Yang et al., 2025).
However, these benchmarks usually evaluate LLMs on well-defined isolated tasks (Hendrycks et al., 2021; Rein et al., 2023; Chollet et al., 2025), a setup which often differs from their practical real-world use.
Trains: [{"id": "TR5972", "departure":...}] We are given: [...] The latest train that arrives in London Liverpool Street before sunset departs at 5:59pm.
Answer: 17:59 I need a train departing Cambridge and arriving at London Liverpool Street today.

Core idea & method

Trains: [{"id": "TR5972", "departure":...}] We are given: [...] The latest train that arrives in London Liverpool Street before sunset departs at 5:59pm.
Answer: 17:59 I need a train departing Cambridge and arriving at London Liverpool Street today.
You are a helpful assistant specialized in providing travel guidance for Cambridge.
[...] [{"function": {"name": "search_trains",...}}] (a) Reasoning in isolation (b) Reasoning within task-oriented dialogue [{"train_id": "TR5972", "departure":...}] I found four trains departing today starting from 5:10pm.
What time is the last departure that will get me to London Liverpool Street before sunset?

Experimental setup & results

highlight the need to evaluate LLM reasoning in realistic interactive scenarios.1 1 Introduction Large Language Models (LLMs) have recently demonstrated promising results on a number of complex reasoning benchmarks (DeepSeek-AI, 2025; Yang et al., 2025).
However, these benchmarks usually evaluate LLMs on well-defined isolated tasks (Hendrycks et al., 2021; Rein et al., 2023; Chollet et al., 2025), a setup which often differs from their practical real-world use.
Trains: [{"id": "TR5972", "departure":...}] We are given: [...] The latest train that arrives in London Liverpool Street before sunset departs at 5:59pm.
Answer: 17:59 I need a train departing Cambridge and arriving at London Liverpool Street today.
You are a helpful assistant specialized in providing travel guidance for Cambridge.

Limitations & risks

Our ablation experiments reveal multiple factors that contribute to the performance gap between the standalone and dialogue settings.
Multi-turn interaction emerges as the most prominent of these, which aligns with findings from previous work on multi-turn evaluation (Bai et al., 2024; Deshpande et al., 2025; Kwan et al., 2024; Wang and Zhao, 2024).
Although role conditioning is not observed consistently across all LLMs, it still affects half of them in our experiments, suggesting that persona bias may be generally problematic for reasoning within TOD.
Different types of persona bias in LLMs have also been identified in previous studies (Gupta et al., 2024; Yeo et al., 2025).

Read-like-fullpaper digest

This paper addresses Large Language Models (LLMs) have recently demonstrated promising results on a number of complex reasoning benchmarks (DeepSeek-AI, 2025; Yang et al., 2025). The core method is Trains: [{"id": "TR5972", "departure":...}] We are given: [...] The latest train that arrives in London Liverpool Street before sunset departs at 5:59pm. Key empirical findings include highlight the need to evaluate LLM reasoning in realistic interactive scenarios.1 1 Introduction Large Language Models (LLMs) have recently demonstrated promising results on a number of complex reasoning benchmarks (DeepSeek-AI, 2025; Yang et al., 2025).

상세 요약 (KO)

문제 정의

LLM(대규모 언어 모델)은 최근 여러 복잡한 추론 벤치마크에서 유망한 결과를 보여주었습니다(DeepSeek-AI, 2025; Yang et al., 2025).
그러나 이러한 벤치마크는 일반적으로 잘 정의된 격리된 작업(Hendrycks et al., 2021; Rein et al., 2023; Chollet et al., 2025)에 대한 LLM을 평가하며, 이는 실제 실제 사용과 종종 다른 설정입니다.
열차: [{"id": "TR5972", "departure":...}] 주어진 정보는 다음과 같습니다. [...] 일몰 전 런던 리버풀 스트리트에 도착하는 가장 늦은 열차는 오후 5시 59분에 출발합니다.
답변: 17:59 오늘 케임브리지에서 출발하여 런던 리버풀 스트리트에 도착하는 기차가 필요합니다.

핵심 아이디어/방법

열차: [{"id": "TR5972", "departure":...}] 주어진 정보는 다음과 같습니다. [...] 일몰 전 런던 리버풀 스트리트에 도착하는 가장 늦은 열차는 오후 5시 59분에 출발합니다.
답변: 17:59 오늘 케임브리지에서 출발하여 런던 리버풀 스트리트에 도착하는 기차가 필요합니다.
당신은 케임브리지 여행 안내를 전문으로 하는 도움이 되는 조수입니다.
[...] [{"function": {"name": "search_trains",...}}] (a) 격리된 추론 (b) 작업 중심 대화 내에서의 추론 [{"train_id": "TR5972", "departure":...}] 오늘 오후 5시 10분부터 출발하는 열차 4대를 발견했습니다.
일몰 전 런던 리버풀 스트리트에 도착할 수 있는 마지막 출발 시간은 언제인가요?

실험 설정/결과

현실적인 대화형 시나리오에서 LLM 추론을 평가해야 한다는 점을 강조합니다.1 1 서문 LLM(대규모 언어 모델)은 최근 여러 복잡한 추론 벤치마크에서 유망한 결과를 보여주었습니다(DeepSeek-AI, 2025; Yang et al., 2025).
그러나 이러한 벤치마크는 일반적으로 잘 정의된 격리된 작업(Hendrycks et al., 2021; Rein et al., 2023; Chollet et al., 2025)에 대한 LLM을 평가하며, 이는 실제 실제 사용과 종종 다른 설정입니다.
열차: [{"id": "TR5972", "departure":...}] 주어진 정보는 다음과 같습니다. [...] 일몰 전 런던 리버풀 스트리트에 도착하는 가장 늦은 열차는 오후 5시 59분에 출발합니다.
답변: 17:59 오늘 케임브리지에서 출발하여 런던 리버풀 스트리트에 도착하는 기차가 필요합니다.
당신은 케임브리지 여행 안내를 전문으로 하는 도움이 되는 조수입니다.

한계/리스크

우리의 절제 실험은 독립 실행형 설정과 대화 설정 간의 성능 격차에 기여하는 여러 요인을 보여줍니다.
다중 회전 상호작용은 이들 중 가장 눈에 띄는 것으로 나타나며, 이는 다중 회전 평가에 대한 이전 연구 결과와 일치합니다(Bai et al., 2024; Deshpande et al., 2025; Kwan et al., 2024; Wang and Zhao, 2024).
역할 조건화는 모든 LLM에서 일관되게 관찰되지는 않지만 실험에서는 여전히 절반에 영향을 미치며, 이는 페르소나 편견이 일반적으로 TOD 내 추론에 문제가 될 수 있음을 시사합니다.
LLM의 다양한 유형의 페르소나 편향도 이전 연구에서 확인되었습니다(Gupta et al., 2024; Yeo et al., 2025).

전체 논문 읽은 느낌 요약

이 문서에서는 최근 여러 복잡한 추론 벤치마크에서 유망한 결과를 입증한 LLM(대규모 언어 모델)을 다룹니다(DeepSeek-AI, 2025; Yang et al., 2025). 핵심 방법은 Trains: [{"id": "TR5972", "departure":...}] 다음과 같습니다. [...] 일몰 전에 런던 리버풀 스트리트에 도착하는 가장 늦은 열차는 오후 5시 59분에 출발합니다. 주요 경험적 연구 결과에는 현실적인 대화형 시나리오에서 LLM 추론을 평가해야 한다는 점을 강조하는 내용이 포함됩니다.1 1 소개 대규모 언어 모델(LLM)은 최근 여러 복잡한 추론 벤치마크에서 유망한 결과를 보여주었습니다(DeepSeek-AI, 2025; Yang et al., 2025).