Instructions to use ianlee1996/pokerbench-qwen3-14b-lora-mixed with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use ianlee1996/pokerbench-qwen3-14b-lora-mixed with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-14B") model = PeftModel.from_pretrained(base_model, "ianlee1996/pokerbench-qwen3-14b-lora-mixed") - Notebooks
- Google Colab
- Kaggle
PokerBench Qwen3-14B + LoRA (Mixed 50/50)
Reproduction of Zhuang et al., AAAI 2025 — "PokerBench: Training Large Language Models to become Professional Poker Players", with two enhancements over the paper:
- Newer base model: Qwen3-14B (May 2025) instead of the paper's Llama-3-8B
- Production-aware fine-tuning: a 50/50 mix of paper-format (RZ412/PokerBench) and a richer production prompt format that adds equity vs opponent range, SPR, opponent range descriptors, and blockers
The LoRA adapter is trained in two stages:
- Stage 1: full LoRA on 60k pre-flop + 500k post-flop paper-format samples, 1 epoch (4375 steps) → reaches EM 90.07% / AA 90.55% on the 11k test set, +11.8 EM over the paper's Llama-3-8B baseline (78.26%)
- Stage 2 (this checkpoint): continue training Stage 1's adapter on 10k paper + 10k production-PE-format records (50/50 mix) for 250 steps at LR 5e-5 → adapts to the rich production prompt format without forgetting paper format
Headline numbers
| Metric | Stage 1 only (paper-only LoRA) | This checkpoint (mixed 50/50) | Pure-PE adapter (control) |
|---|---|---|---|
| Paper 11k EM | 90.07% | 89.86% ✅ | 69.24% ⬇ |
| Paper 11k AA | 90.55% | 90.32% | 69.24% |
| Paper parse failures | 0 / 11000 | 0 / 11000 | 2,514 / 11,000 |
| Production-PE 200 AA | 61.5% | 84.0% ⭐ | 83.5% |
| Production-PE preflop AA | 58.0% | 92.0% | 90.0% |
| Production-PE postflop AA | 65.0% | 76.0% | 77.0% |
The mixed checkpoint wins on every dimension that matters: production PE deployment metric ties or beats pure-PE training, paper benchmark fully recovers (only -0.21 EM, statistical noise), parse failures stay at zero. Intuition is regularization — paper data anchors the action distribution and GTO behavior, while the 5k production-format samples teach format-specific cues without overfitting.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3-14B",
torch_dtype=torch.bfloat16,
device_map="auto",
)
model = PeftModel.from_pretrained(base, "ianlee1996/pokerbench-qwen3-14b-lora-mixed")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-14B")
# Paper-format prompt (PokerBench dataset style)
instruction = """
You are a specialist in playing 6-handed No Limit Texas Holdem. The following will be a game scenario and you need to make the optimal decision.
Here is a game summary:
The small blind is 0.5 chips and the big blind is 1 chips. Everyone started with 100 chips.
The player positions involved in this game are UTG, HJ, CO, BTN, SB, BB.
In this hand, your position is BTN, and your holding is [Ace of Heart and King of Heart].
Before the flop, there has been no action yet. Assume that all other players that is not mentioned folded.
Now it is your turn to make a move.
To remind you, the current pot size is 1.5 chips, and your holding is [Ace of Heart and King of Heart].
Decide on an action based on the strength of your hand on this board, your position, and actions before you. Do not explain your answer.
Your optimal action is:"""
system_prompt = (
"You are a specialist in playing 6-handed No Limit Texas Holdem. "
"Output ONLY the optimal action with no explanation. "
"Valid formats: 'fold', 'check', 'call', 'bet N', 'raise N', 'all-in'."
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": instruction},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=16, temperature=0.1, top_p=0.95, do_sample=True)
print(tokenizer.decode(out[0, inputs.input_ids.shape[1]:], skip_special_tokens=True))
# expected: "raise 2.5" or similar
The same adapter also handles richer production prompts that include equity vs opponent range, SPR, opponent range labels, and blockers.
Training recipe
Stage 1 (full LoRA on paper format)
- Data: full 60k preflop + 500k postflop train splits of
RZ412/PokerBench(1 epoch ≈ 4375 steps) - LoRA: r=32, alpha=64, dropout=0.05, target=all-linear
- Optimizer: paged_adamw_8bit, LR 2e-4, cosine schedule, warmup 0.03
- Batch: effective 128 (4 per device × 32 grad accumulation)
- Loss: TRL
assistant_only_loss=True - Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB)
- Time: ~22.5 hours
Stage 2 (this checkpoint, mixed 50/50)
- Data: 10k paper-format from PokerBench (5k pre + 5k post) + 10k production-PE-format records (zh/en mixed). Production-PE records are generated by feeding PokerBench rows through a TypeScript prompt builder that computes equity vs opponent range (Monte Carlo over a buildOppRange-derived range), SPR, range labels, blockers, draws, and lays them out in the production deployment template.
- Same LoRA shape, but continued from Stage 1 adapter (PEFT
is_trainable=True) - Optimizer: same as Stage 1 except LR lowered to 5e-5 (we're nudging an already-trained adapter, not training from scratch)
- 250 steps at batch 128
- Time: ~1.6 hours
Evaluation details
Paper benchmark
Standard PokerBench 11k test set (1k preflop + 10k postflop), greedy decoding (temperature=0.1, top_p=0.95, max_tokens=16). EM = exact action+size match; AA = action category match (bet collapsed to raise per the paper). 0 parse failures across all 11000 samples.
Production-PE eval
200 samples (100 preflop + 100 postflop) drawn from the test split of PokerBench, then re-rendered with the production prompt format that adds: equity-vs-opponent-range (Monte Carlo over a buildOppRange-derived range), SPR, opponent range labels, blocker notes, draws (flush/oesd/gutshot + outs), available actions, and bet-option pct shortcuts. The eval tests AA on this richer prompt format that the model would see in a real deployment. Important: these 200 samples are taken from the test split (postflop_10k_test_set and preflop_1k_test_set) and were never seen during training — Stage 2 trained only on the train split.
Other artifacts
- GitHub (full reproduction code, training scripts, eval harness, design spec, plan): https://github.com/IanLiYi1996/PokerBench
- Stage 1-only adapter: same recipe minus Stage 2 — shipped via direct S3 to early adopters; reach out if you need it for paper-only deployments
Citation
If you use this adapter, cite the original paper:
@inproceedings{zhuang2025pokerbench,
title={PokerBench: Training Large Language Models to become Professional Poker Players},
author={Zhuang, Richard and Gupta, Akshat and Yang, Richard and Rahane, Aniket and Li, Zhengyu and Anumanchipalli, Gopala},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
year={2025},
url={https://arxiv.org/abs/2501.08328}
}
License
This LoRA adapter is released under Apache-2.0, matching the Qwen/Qwen3-14B base model license. The PokerBench dataset is also Apache-2.0.
Framework versions
- PEFT 0.19.1
- transformers (Qwen3 chat template)
- trl 1.5.1 (
assistant_only_loss=Truefor masked SFT)
- Downloads last month
- 17