#2 FinTradeBench: A Financial Reasoning Benchmark for LLMs

Score: 21.6 | Matched keywords: alignment, benchmark, large language models, llm, reasoning, retrieval-augmented

Detailed Summary (EN)

Problem definition

Real-world financial analysis requires reasoning on two complementary information sources: company Figure 1: Performance comparison of proprietary LLMs on a trading signal-focused question.
There was no pullback in Nvidia’s stock in July 2025, and it was not a lucrative buying opportunity; only Claude correctly identified the pullback component.
Company fundamentals are accounting-based metrics derived from company balance sheets or Securities and Exchange Commission (SEC) filings, such as profitability, leverage, and valuation ratios, that capture a company’s underlying financial health (Fama and French, 1992; Harvey et al., 2016).
In contrast, trading signals, computed from historical price and volume data, capture market dynamics and investor sentiment, including momentum, volatility, and trend reversals (Brock et al., 1992; Jegadeesh and Titman, 1993; Lo et al., 2000; Andersen et al., 2003; Park and Irwin, 2007; Choi, 2021).

Core idea & method

that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human–LLM judge alignment.
We evaluate 14 LLMs under zeroshot prompting and retrieval-augmented settings and witness a clear performance gap.
Retrieval substantially improves reasoning over textual fundamentals, but provides limited benefit for trading-signal reasoning.
These findings highlight fundamental challenges in the numerical and time-series reasoning for current LLMs and motivate future research in financial intelligence.
1 Introduction Real-world financial analysis requires reasoning on two complementary information sources: company Figure 1: Performance comparison of proprietary LLMs on a trading signal-focused question.

Experimental setup & results

Using a calibration-thenscaling pipeline, we combine 150 expert-authored seed questions (50 per category), each with golden key indicators, and scale them across firms and time periods to yield 1,400 total benchmark questions.
We benchmark 14 LLMs in zero-shot prompting and retrieval-augmented settings and witness a clear performance gap in financial reasoning.
Retrieval substantially improves performance on fundamentals-focused questions (↑+37% higher accuracy), and hybrid reasoning questions (↑+55% higher accuracy), but offers limited or negative gains for trading-signal questions derived from time-series data; see Table 2.
This suggests that while current LLMs can effectively leverage textual financial information, they struggle to interpret quantitative market dynamics.
2 Background and Related Work Financial Question-Answering (QA) Benchmarks.

Limitations & risks

Table 2 reports the performance comparison of RAG-based and No-RAG architectures of 14 evaluated LLMs on FinTradeBench.
Paired t-tests on question-level correctness scores assess the statistical reliability of RAG-induced changes.
Table 3 (and Figure 5 in §H.3) complement this with global generative quality metrics, revealing how RAG reshapes model reasoning behavior beyond raw accuracy.
Our analysis surfaces the following findings: (1) RAG strongly benefits fundamental reasoning (F) and degrades trading signal (T) reasoning.

Read-like-fullpaper digest

This paper addresses Real-world financial analysis requires reasoning on two complementary information sources: company Figure 1: Performance comparison of proprietary LLMs on a trading signal-focused question. The core method is that combines expert seed questions, multi-model response generation, intra-model self-filtering, numerical auditing, and human–LLM judge alignment. Key empirical findings include Using a calibration-thenscaling pipeline, we combine 150 expert-authored seed questions (50 per category), each with golden key indicators, and scale them across firms and time periods to yield 1,400 total benchmark questions.

상세 요약 (KO)

문제 정의

실제 재무 분석에는 두 가지 보완적인 정보 소스에 대한 추론이 필요합니다. 회사 그림 1: 거래 신호 중심 질문에 대한 독점 LLM의 성과 비교.
2025년 7월 엔비디아 주식은 하락세도 없었고, 수익성이 좋은 매수 기회도 아니었습니다. Claude만이 풀백 구성 요소를 올바르게 식별했습니다.
회사 펀더멘털은 수익성, 레버리지, 가치 평가 비율 등 회사 대차대조표나 증권거래위원회(SEC) 서류에서 파생된 회계 기반 지표로, 회사의 기본 재무 건전성을 포착합니다(Fama and French, 1992; Harvey et al., 2016).
대조적으로, 과거 가격 및 거래량 데이터로부터 계산된 거래 신호는 모멘텀, 변동성 및 추세 반전을 포함한 시장 역학 및 투자 심리를 포착합니다(Brock et al., 1992; Jegadeesh and Titman, 1993; Lo et al., 2000; Andersen et al., 2003; Park and Irwin, 2007; Choi, 2021).

핵심 아이디어/방법

이는 전문가 시드 질문, 다중 모델 응답 생성, 모델 내 자체 필터링, 수치 감사 및 인간-LLM 심사위원 얼라인먼트을 결합합니다.
우리는 제로샷 프롬프트 및 검색 강화 설정에서 14개의 LLM을 평가하고 명확한 성능 격차를 목격했습니다.
검색은 텍스트 기본 사항에 대한 추론을 크게 향상시키지만 거래 신호 추론에는 제한적인 이점을 제공합니다.
이러한 발견은 현재 LLM에 대한 수치 및 시계열 추론의 근본적인 과제를 강조하고 금융 정보에 대한 향후 연구에 동기를 부여합니다.
1 소개 실제 재무 분석에는 두 가지 보완적인 정보 소스에 대한 추론이 필요합니다. 회사 그림 1: 거래 신호 중심 질문에 대한 독점 LLM의 성능 비교.

실험 설정/결과

보정-조정 파이프라인을 사용하여 전문가가 작성한 150개의 시드 질문(카테고리당 50개)을 각각 골든 핵심 지표와 결합하고 이를 회사 및 기간에 걸쳐 확장하여 총 1,400개의 벤치마크 질문을 생성합니다.
우리는 제로 샷 프롬프트 및 검색 강화 설정에서 14개의 LLM을 벤치마킹했으며 재무 추론에서 명확한 성능 격차를 목격했습니다.
검색은 기본 중심 질문(정확도 ↑+37% 더 높음) 및 하이브리드 추론 질문(정확도 ↑+55% 더 높음)에 대한 성능을 크게 향상시키지만 시계열 데이터에서 파생된 거래 신호 질문에 대해서는 제한적이거나 부정적인 이득을 제공합니다. 표 2를 참조하세요.
이는 현재 LLM이 텍스트 금융 정보를 효과적으로 활용할 수 있지만 정량적 시장 역학을 해석하는 데 어려움을 겪고 있음을 시사합니다.
2 배경 및 관련 작업 재무 질의응답(QA) 벤치마크.

한계/리스크

표 2는 FinTradeBench에서 평가된 14개 LLM의 RAG 기반 및 No-RAG 아키텍처의 성능 비교를 보고합니다.
질문 수준 정확성 점수에 대한 대응 t-검정은 RAG로 인한 변화의 통계적 신뢰성을 평가합니다.
표 3(및 §H.3의 그림 5)은 이를 글로벌 생성 품질 측정항목으로 보완하여 RAG가 원시 정확도를 넘어 모델 추론 동작을 어떻게 재구성하는지 보여줍니다.
우리의 분석은 다음과 같은 결과를 드러냅니다. (1) RAG는 기본 추론(F)에 큰 이점을 제공하고 거래 신호(T) 추론을 저하시킵니다.

전체 논문 읽은 느낌 요약

이 백서에서는 실제 재무 분석에 대해 두 가지 보완적인 정보 소스에 대한 추론이 필요하다는 점을 다룹니다. 회사 그림 1: 거래 신호 중심 질문에 대한 독점 LLM의 성능 비교. 핵심 방법은 전문가 시드 질문, 다중 모델 응답 생성, 모델 내 자체 필터링, 수치 감사 및 인간-LLM 심사위원 얼라인먼트을 결합하는 것입니다. 주요 실증적 결과는 다음과 같습니다. 교정-조정 파이프라인을 사용하여 전문가가 작성한 150개의 시드 질문(카테고리당 50개)을 각각 골든 핵심 지표와 결합하고 이를 회사 및 기간에 걸쳐 확장하여 총 1,400개의 벤치마크 질문을 생성했습니다.