PokerBench Qwen3-14B + LoRA (Mixed 50/50)

Reproduction of Zhuang et al., AAAI 2025 — "PokerBench: Training Large Language Models to become Professional Poker Players", with two enhancements over the paper:

  1. Newer base model: Qwen3-14B (May 2025) instead of the paper's Llama-3-8B
  2. Production-aware fine-tuning: a 50/50 mix of paper-format (RZ412/PokerBench) and a richer production prompt format that adds equity vs opponent range, SPR, opponent range descriptors, and blockers

The LoRA adapter is trained in two stages:

  • Stage 1: full LoRA on 60k pre-flop + 500k post-flop paper-format samples, 1 epoch (4375 steps) → reaches EM 90.07% / AA 90.55% on the 11k test set, +11.8 EM over the paper's Llama-3-8B baseline (78.26%)
  • Stage 2 (this checkpoint): continue training Stage 1's adapter on 10k paper + 10k production-PE-format records (50/50 mix) for 250 steps at LR 5e-5 → adapts to the rich production prompt format without forgetting paper format

Headline numbers

Metric Stage 1 only (paper-only LoRA) This checkpoint (mixed 50/50) Pure-PE adapter (control)
Paper 11k EM 90.07% 89.86% 69.24% ⬇
Paper 11k AA 90.55% 90.32% 69.24%
Paper parse failures 0 / 11000 0 / 11000 2,514 / 11,000
Production-PE 200 AA 61.5% 84.0% 83.5%
Production-PE preflop AA 58.0% 92.0% 90.0%
Production-PE postflop AA 65.0% 76.0% 77.0%

The mixed checkpoint wins on every dimension that matters: production PE deployment metric ties or beats pure-PE training, paper benchmark fully recovers (only -0.21 EM, statistical noise), parse failures stay at zero. Intuition is regularization — paper data anchors the action distribution and GTO behavior, while the 5k production-format samples teach format-specific cues without overfitting.

Usage

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3-14B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "ianlee1996/pokerbench-qwen3-14b-lora-mixed")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")

# Paper-format prompt (PokerBench dataset style)
instruction = """

You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.

Here is a game summary:

The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is BTN, and your holding is [Ace of Heart and King of Heart].
Before the flop, there has been no action yet. Assume that all other players that is not mentioned folded.

Now it is your turn to make a move.
To remind you, the current pot size is 1.5 chips, and your holding is [Ace of Heart and King of Heart].

Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.
Your optimal action is:"""

system_prompt = (
    "You are a specialist in playing 6-handed No Limit Texas Holdem. "
    "Output ONLY the optimal action with no explanation. "
    "Valid formats: 'fold', 'check', 'call', 'bet N', 'raise N', 'all-in'."
)
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user", "content": instruction},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, top_p=0.95, do_sample=True)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
# expected: "raise 2.5" or similar

The same adapter also handles richer production prompts that include equity vs opponent range, SPR, opponent range labels, and blockers.

Training recipe

Stage 1 (full LoRA on paper format)

  • Data: full 60k preflop + 500k postflop train splits of RZ412/PokerBench (1 epoch ≈ 4375 steps)
  • LoRA: r=32, alpha=64, dropout=0.05, target=all-linear
  • Optimizer: paged_adamw_8bit, LR 2e-4, cosine schedule, warmup 0.03
  • Batch: effective 128 (4 per device × 32 grad accumulation)
  • Loss: TRL assistant_only_loss=True
  • Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB)
  • Time: ~22.5 hours

Stage 2 (this checkpoint, mixed 50/50)

  • Data: 10k paper-format from PokerBench (5k pre + 5k post) + 10k production-PE-format records (zh/en mixed). Production-PE records are generated by feeding PokerBench rows through a TypeScript prompt builder that computes equity vs opponent range (Monte Carlo over a buildOppRange-derived range), SPR, range labels, blockers, draws, and lays them out in the production deployment template.
  • Same LoRA shape, but continued from Stage 1 adapter (PEFT is_trainable=True)
  • Optimizer: same as Stage 1 except LR lowered to 5e-5 (we're nudging an already-trained adapter, not training from scratch)
  • 250 steps at batch 128
  • Time: ~1.6 hours

Evaluation details

Paper benchmark

Standard PokerBench 11k test set (1k preflop + 10k postflop), greedy decoding (temperature=0.1, top_p=0.95, max_tokens=16). EM = exact action+size match; AA = action category match (bet collapsed to raise per the paper). 0 parse failures across all 11000 samples.

Production-PE eval

200 samples (100 preflop + 100 postflop) drawn from the test split of PokerBench, then re-rendered with the production prompt format that adds: equity-vs-opponent-range (Monte Carlo over a buildOppRange-derived range), SPR, opponent range labels, blocker notes, draws (flush/oesd/gutshot + outs), available actions, and bet-option pct shortcuts. The eval tests AA on this richer prompt format that the model would see in a real deployment. Important: these 200 samples are taken from the test split (postflop_10k_test_set and preflop_1k_test_set) and were never seen during training — Stage 2 trained only on the train split.

Other artifacts

  • GitHub (full reproduction code, training scripts, eval harness, design spec, plan): https://github.com/IanLiYi1996/PokerBench
  • Stage 1-only adapter: same recipe minus Stage 2 — shipped via direct S3 to early adopters; reach out if you need it for paper-only deployments

Citation

If you use this adapter, cite the original paper:

@inproceedings{zhuang2025pokerbench,
  title={PokerBench: Training Large Language Models to become Professional Poker Players},
  author={Zhuang, Richard and Gupta, Akshat and Yang, Richard and Rahane, Aniket and Li, Zhengyu and Anumanchipalli, Gopala},
  booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
  year={2025},
  url={https://arxiv.org/abs/2501.08328}
}

License

This LoRA adapter is released under Apache-2.0, matching the Qwen/Qwen3-14B base model license. The PokerBench dataset is also Apache-2.0.

Framework versions

  • PEFT 0.19.1
  • transformers (Qwen3 chat template)
  • trl 1.5.1 (assistant_only_loss=True for masked SFT)
Downloads last month
17
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ianlee1996/pokerbench-qwen3-14b-lora-mixed

Finetuned
Qwen/Qwen3-14B
Adapter
(325)
this model

Dataset used to train ianlee1996/pokerbench-qwen3-14b-lora-mixed

Paper for ianlee1996/pokerbench-qwen3-14b-lora-mixed