#5 GTO Wizard Benchmark

Score: 23.4 | Matched keywords: agent, ai, benchmark, large language models, llm, reasoning

Detailed Summary (EN)

Read-like-fullpaper digest

This paper tackles Games have long been an integral part of the field of artificial intelligence, by providing difficult but easily verifiable benchmarks that test similar sets of skills that we would expect from strong agents – reasoning, strategic planning, and sequential decision-making. Rather than building an isolated, highly optimized Heads-Up agent, we developed a general agent able to play a wide variety of two-player and multi-player scenarios, including varying stack sizes, cash game formats (rake, antes, etc.), and tournament configurations. • Our system evaluates agents using AIVAT[5], a provably unbiased variance reduction technique for assessing performance in imperfect information games, which allows agents to achieve the same statistical significance with ten times less data.

The core proposal is for benchmarking algorithms in Heads-Up No-Limit Texas Hold’em (HUNL).

The empirical case is built around We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT [5], a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation.

The central reported finding is Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats. We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others. Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT [5], a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation. Human-versus-agent matches have long been a standard means for evaluation ([8, 3]), but they are costly and challenging to organize—meaning they are feasible only for research initiatives with significant resources.

The paper also makes it clear that However, the gap between superhuman and current reasoning models remains significant. Ultimately, poker remains a challenging benchmark for multi-agent reasoning under partial observability. Overall, the paper is most convincing where its proposed method is directly supported by the reported comparisons, but the scope of the claim should still be read in light of the evaluation setup and stated limitations.

Final takeaway

Main takeaway: Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats.
Most important supporting result: We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.
Important caution: However, the gap between superhuman and current reasoning models remains significant.

Problem definition

Games have long been an integral part of the field of artificial intelligence, by providing difficult but easily verifiable benchmarks that test similar sets of skills that we would expect from strong agents – reasoning, strategic planning, and sequential decision-making.
Rather than building an isolated, highly optimized Heads-Up agent, we developed a general agent able to play a wide variety of two-player and multi-player scenarios, including varying stack sizes, cash game formats (rake, antes, etc.), and tournament configurations.
• Our system evaluates agents using AIVAT[5], a provably unbiased variance reduction technique for assessing performance in imperfect information games, which allows agents to achieve the same statistical significance with ten times less data.
Due to this foundation, we have high ambitions for the future of the platform, including continually improving our agents, expanding to other game variants such as Pot-Limit Omaha, and introducing support for more than two players.

Core idea & method

for benchmarking algorithms in Heads-Up No-Limit Texas Hold’em (HUNL).

Actual findings

Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats.
We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.

How the conclusion was reached

Step 1 — Proposed approach: for benchmarking algorithms in Heads-Up No-Limit Texas Hold’em (HUNL).
Step 2 — Evaluation setup or comparison basis: We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.
Step 3 — Main reported evidence: Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats.
Step 4 — Additional supporting or qualifying result: We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.
Step 5 — Claim boundary / limitation: However, the gap between superhuman and current reasoning models remains significant.

Experimental setup & results

Importantly, though this initial release only supports the Heads-Up No-Limit Texas 1 [cs.AI] 24 Mar 2026 Hold’em format, our benchmark evaluates against a general poker agent capable of playing a variety of formats.
We conduct a comprehensive benchmarking study of state-of-the-art large language models under zero-shot conditions, including GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4, and others.
Variance is a fundamental challenge in poker evaluation; we address this by integrating AIVAT [5], a provably unbiased variance reduction technique that achieves equivalent statistical significance with ten times fewer hands than naive Monte Carlo evaluation.
Human-versus-agent matches have long been a standard means for evaluation ([8, 3]), but they are costly and challenging to organize—meaning they are feasible only for research initiatives with significant resources.
This benchmark provides researchers with a precise and quantifiable setting to evaluate advances in planning and reasoning in multi-agent systems with partial observability.
Despite the growing interest in applying generalist AI agents and LLMs to games, a standardized platform for benchmarking their performance in poker has been lacking.

Limitations & risks

However, the gap between superhuman and current reasoning models remains significant.
Ultimately, poker remains a challenging benchmark for multi-agent reasoning under partial observability.

상세 요약 (KO)

전체 논문 읽은 느낌 요약

이 문서에서는 추론, 전략 계획, 순차적 의사 결정 등 강력한 에이전트에서 기대할 수 있는 유사한 기술 세트를 테스트하는 어렵지만 쉽게 검증할 수 있는 벤치마크를 제공함으로써 게임이 오랫동안 인공 지능 분야의 필수적인 부분이 되어 왔다는 점을 다루고 있습니다. 고립되고 고도로 최적화된 헤드업 에이전트를 구축하는 대신, 우리는 다양한 스택 크기, 캐시 게임 형식(레이크, 앤티 등) 및 토너먼트 구성을 포함하여 다양한 2인용 및 멀티플레이어 시나리오를 플레이할 수 있는 일반 에이전트를 개발했습니다. • 우리 시스템은 불완전한 정보 게임의 성능을 평가하기 위한 편견이 없는 분산 감소 기술인 AIVAT[5]를 사용하여 에이전트를 평가합니다. 이를 통해 에이전트는 10배 적은 데이터로 동일한 통계적 유의성을 달성할 수 있습니다. 핵심 제안은 Heads-Up No-Limit Texas Hold'em(HUNL)의 벤치마킹 알고리즘에 대한 것입니다. 경험적 사례는 GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다. 중요한 점은 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다. GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다. 다양성은 포커 평가에서 근본적인 문제입니다. 우리는 순진한 Monte Carlo 평가보다 10배 더 적은 수의 손으로 동등한 통계적 유의성을 달성하는 입증된 편견 없는 분산 감소 기술인 AIVAT [5]를 통합하여 이 문제를 해결합니다. 중요한 보고 결과는 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다. GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다. 다양성은 포커 평가에서 근본적인 문제입니다. 우리는 순진한 Monte Carlo 평가보다 10배 더 적은 수의 손으로 동등한 통계적 유의성을 달성하는 입증된 편견 없는 분산 감소 기술인 AIVAT [5]를 통합하여 이 문제를 해결합니다. 인간 대 에이전트 일치는 오랫동안 평가를 위한 표준 수단이었지만([8, 3]) 비용이 많이 들고 구성하기가 어렵습니다. 즉, 상당한 리소스가 있는 연구 이니셔티브에만 실현 가능하다는 의미입니다. 이 논문은 또한 초인적 추론 모델과 현재 추론 모델 사이의 격차가 여전히 크다는 점을 분명히 밝혔습니다. 궁극적으로 포커는 부분 관찰 가능성 하에서 다중 에이전트 추론에 대한 도전적인 벤치마크로 남아 있습니다. 전반적으로, 이 논문은 제안된 방법이 보고된 비교에 의해 직접적으로 뒷받침된다는 점에서 가장 설득력이 있지만, 청구 범위는 평가 설정 및 명시된 제한 사항을 고려하여 읽어야 합니다.

핵심 결론

주요 내용: 중요한 점은 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다.
가장 중요한 지원 결과: GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다.
중요한 주의 사항: 그러나 초인적 추론 모델과 현재 추론 모델 사이의 격차는 여전히 상당합니다.

문제 정의

게임은 추론, 전략 계획, 순차적 의사 결정 등 강력한 에이전트에서 기대할 수 있는 유사한 기술 세트를 테스트하는 어렵지만 쉽게 검증할 수 있는 벤치마크를 제공함으로써 오랫동안 인공 지능 분야의 필수적인 부분이었습니다.
고립되고 고도로 최적화된 헤드업 에이전트를 구축하는 대신, 우리는 다양한 스택 크기, 캐시 게임 형식(레이크, 앤티 등) 및 토너먼트 구성을 포함하여 다양한 2인용 및 멀티플레이어 시나리오를 플레이할 수 있는 일반 에이전트를 개발했습니다.
• 우리 시스템은 불완전한 정보 게임의 성능을 평가하기 위한 편견이 없는 분산 감소 기술인 AIVAT[5]를 사용하여 에이전트를 평가합니다. 이를 통해 에이전트는 10배 적은 데이터로 동일한 통계적 유의성을 달성할 수 있습니다.
이러한 기반으로 인해 우리는 에이전트를 지속적으로 개선하고 Pot-Limit Omaha와 같은 다른 게임 변형으로 확장하며 2명 이상의 플레이어에 대한 지원을 도입하는 등 플랫폼의 미래에 대한 높은 야망을 가지고 있습니다.

핵심 아이디어/방법

HUNL(Heads-Up No-Limit Texas Hold'em)의 벤치마킹 알고리즘을 위한 것입니다.

실제 결과

중요한 점은 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다.
GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다.

결론이 나온 과정

1단계 — 제안된 접근 방식: HUNL(Heads-Up No-Limit Texas Hold'em)의 알고리즘 벤치마킹용.
2단계 — 평가 설정 또는 비교 기반: GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다.
3단계 — 보고된 주요 증거: 중요한 점은 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다.
4단계 — 추가 지원 또는 적격 결과: GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다.
5단계 — 주장 경계/제한: 그러나 초인적 추론 모델과 현재 추론 모델 사이의 격차는 여전히 상당합니다.

실험 설정/결과

중요한 점은 이 초기 릴리스가 Heads-Up No-Limit Texas 1 [cs.AI] 2026년 3월 24일 홀덤 형식만 지원하지만 벤치마크는 다양한 형식을 플레이할 수 있는 일반 포커 에이전트와 비교하여 평가한다는 것입니다.
GPT-5.4, Claude Opus 4.6, Gemini 3.1 Pro, Grok 4 등을 포함하여 제로샷 조건에서 최첨단 대규모 언어 모델에 대한 포괄적인 벤치마킹 연구를 수행합니다.
다양성은 포커 평가에서 근본적인 문제입니다. 우리는 순진한 Monte Carlo 평가보다 10배 더 적은 수의 손으로 동등한 통계적 유의성을 달성하는 입증된 편견 없는 분산 감소 기술인 AIVAT [5]를 통합하여 이 문제를 해결합니다.
인간 대 에이전트 일치는 오랫동안 평가를 위한 표준 수단이었지만([8, 3]) 비용이 많이 들고 구성하기가 어렵습니다. 즉, 상당한 리소스가 있는 연구 이니셔티브에만 실현 가능하다는 의미입니다.
이 벤치마크는 부분 관찰 기능을 통해 다중 에이전트 시스템의 계획 및 추론의 발전을 평가할 수 있는 정확하고 정량화 가능한 설정을 연구자에게 제공합니다.
일반 AI 에이전트 및 LLM을 게임에 적용하는 데 대한 관심이 높아지고 있음에도 불구하고 포커에서의 성과를 벤치마킹하기 위한 표준화된 플랫폼은 부족했습니다.

한계/리스크

그러나 초인간적 추론 모델과 현재 추론 모델 사이의 격차는 여전히 상당합니다.
궁극적으로 포커는 부분 관찰 가능성 하에서 다중 에이전트 추론에 대한 도전적인 벤치마크로 남아 있습니다.