Poker & Blackjack Vision AI — Gemma 4 26B-A4B LoRA
Fine-tuned Gemma 4 26B-A4B (vision MoE: 26B total params, 4B active per token) for poker and blackjack decision-making.
What This Model Does
Given a poker or blackjack game state, the model outputs the optimal action (fold, call, raise, hit, stand, etc.) as JSON.
This is a vision model — it can process card images directly, enabling camera-to-decision pipelines (e.g., smart glasses → see cards → optimal play).
Training Details
- Base model:
unsloth/gemma-4-26B-A4B-it(MoE with vision encoder) - Method: LoRA (r=16, alpha=32) on q/k/v/o/gate/up/down projections
- Data: 12,848 examples (3,072 poker + 9,776 blackjack)
- Poker: Winning decisions from TAGBot, EquityBot, ExploitBot
- Blackjack: Counter (I18) optimal strategy
- Training: 3 epochs, 2,289 steps on NVIDIA A6000 48GB
- Final metrics: Loss 0.109, Token accuracy 95.95%
- Cost: ~$3.73 on RunPod
Poker Prompt Format
You are a specialist in playing 6-handed No Limit Texas Holdem. Do not explain your answer.
Game summary:
- Small blind: 5 chips, Big blind: 10 chips
- Your position: BTN, Your holding: As Ks
- Board: 7c 4d 2h
- Pot: 75, To call: 0, Your stack: 970
- Equity: 73%, Pot odds: 0%
- Legal actions: check, raise, all_in
Respond with ONLY valid JSON: {"action": "fold|check|call|raise|all_in", "amount": 0}
Blackjack Prompt Format
You are a blackjack expert. Decide the best action.
Your cards: Ace, 6 (total: 17 soft)
Dealer showing: 9
Available actions: hit, stand, double
Respond with ONLY the action word.
Usage with llama.cpp (recommended for Mac)
# Merge LoRA → GGUF Q3_K_M (~11GB, fits 16GB Mac)
# Then serve:
llama-server --model gemma4-poker-26b-q3_k_m.gguf --port 8080 --n-gpu-layers 999 --ctx-size 2048 --jinja
Important: Disable thinking mode for fast responses:
{"chat_template_kwargs": {"enable_thinking": false}}
Arena Results (E4B version, 1000 hands)
| Metric | Value |
|---|---|
| BB/100 | -0.1 (breakeven) |
| VPIP | 80.5% |
| Style | LAG (loose-aggressive) |
| vs CallingStation | +9.3 bb/100 |
Note: Model plays too many hands (VPIP too high). GRPO reinforcement learning is the planned fix.
Part of the Flywheel
This model is part of an iterative training loop:
- Bots play → generate winning decisions → SFT training (this model)
- Model plays in arena → find weaknesses → GRPO with reward functions
- Retrain → better model → repeat
Links
- E4B version (smaller, 7.5B)
- Built with the poker/blackjack arena platform
Model tree for waltgrace/poker-gemma4-26b-a4b-lora
Base model
google/gemma-4-26B-A4B Finetuned
google/gemma-4-26B-A4B-it Finetuned
unsloth/gemma-4-26B-A4B-it