Poker & Blackjack Vision AI — Gemma 4 26B-A4B LoRA

Fine-tuned Gemma 4 26B-A4B (vision MoE: 26B total params, 4B active per token) for poker and blackjack decision-making.

What This Model Does

Given a poker or blackjack game state, the model outputs the optimal action (fold, call, raise, hit, stand, etc.) as JSON.

This is a vision model — it can process card images directly, enabling camera-to-decision pipelines (e.g., smart glasses → see cards → optimal play).

Training Details

Base model: unsloth/gemma-4-26B-A4B-it (MoE with vision encoder)
Method: LoRA (r=16, alpha=32) on q/k/v/o/gate/up/down projections
Data: 12,848 examples (3,072 poker + 9,776 blackjack)
- Poker: Winning decisions from TAGBot, EquityBot, ExploitBot
- Blackjack: Counter (I18) optimal strategy
Training: 3 epochs, 2,289 steps on NVIDIA A6000 48GB
Final metrics: Loss 0.109, Token accuracy 95.95%
Cost: ~$3.73 on RunPod

Poker Prompt Format

You are a specialist in playing 6-handed No Limit Texas Holdem. Do not explain your answer.

Game summary:
- Small blind: 5 chips, Big blind: 10 chips
- Your position: BTN, Your holding: As Ks
- Board: 7c 4d 2h
- Pot: 75, To call: 0, Your stack: 970
- Equity: 73%, Pot odds: 0%
- Legal actions: check, raise, all_in

Respond with ONLY valid JSON: {"action": "fold|check|call|raise|all_in", "amount": 0}

Blackjack Prompt Format

You are a blackjack expert. Decide the best action.
Your cards: Ace, 6 (total: 17 soft)
Dealer showing: 9
Available actions: hit, stand, double
Respond with ONLY the action word.

Usage with llama.cpp (recommended for Mac)

# Merge LoRA → GGUF Q3_K_M (~11GB, fits 16GB Mac)
# Then serve:
llama-server --model gemma4-poker-26b-q3_k_m.gguf --port 8080 --n-gpu-layers 999 --ctx-size 2048 --jinja

Important: Disable thinking mode for fast responses:

{"chat_template_kwargs": {"enable_thinking": false}}

Arena Results (E4B version, 1000 hands)

Metric	Value
BB/100	-0.1 (breakeven)
VPIP	80.5%
Style	LAG (loose-aggressive)
vs CallingStation	+9.3 bb/100

Note: Model plays too many hands (VPIP too high). GRPO reinforcement learning is the planned fix.

Part of the Flywheel

This model is part of an iterative training loop:

Bots play → generate winning decisions → SFT training (this model)
Model plays in arena → find weaknesses → GRPO with reward functions
Retrain → better model → repeat

Model tree for waltgrace/poker-gemma4-26b-a4b-lora

Base model

google/gemma-4-26B-A4B

Finetuned

google/gemma-4-26B-A4B-it

Finetuned

unsloth/gemma-4-26B-A4B-it

Adapter

(4)

this model

waltgrace
/

poker-gemma4-26b-a4b-lora