---
language:
- en
tags:
- reinforcement-learning
- wordle
- game-ai
- grpo
- qwen2
- text-generation
- fine-tuned
license: apache-2.0
base_model: Qwen/Qwen2-0.5B-Instruct
---

# 🎯 Wordle AI — Fine-tuned with GRPO

> A language model trained to play Wordle using Group Relative Policy Optimization (GRPO) reinforcement learning.

[![Model](https://img.shields.io/badge/Model-Qwen2--0.5B-blue)](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct)
[![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://opensource.org/licenses/Apache-2.0)
[![Framework](https://img.shields.io/badge/Framework-Transformers-orange)](https://huggingface.co/docs/transformers)
[![RL Algorithm](https://img.shields.io/badge/RL-GRPO-purple)](https://huggingface.co/docs/trl)

---

## 📖 Overview

This model is a fine-tuned version of **Qwen2-0.5B-Instruct** trained to play the popular word game **Wordle** using reinforcement learning. Instead of supervised learning from human examples, this model learned purely from reward signals — improving its strategy game by game through the GRPO algorithm.

The model learns strategies like:
- Opening with vowel-rich words like **CRANE** or **SLATE**
- Using green letter positions in subsequent guesses
- Repositioning yellow letters correctly
- Never repeating previously guessed words

---

## 🏗️ Model Details

| Property | Value |
|---|---|
| **Base Model** | Qwen/Qwen2-0.5B-Instruct |
| **Model Size** | 0.5B parameters |
| **Tensor Type** | F16 |
| **Training Algorithm** | GRPO (Group Relative Policy Optimization) |
| **Training Games** | 20 |
| **Hardware** | NVIDIA T4 GPU |
| **Framework** | Hugging Face Transformers + TRL |
| **Environment** | OpenEnv + TextArena Wordle |

---

## 🎮 What is Wordle?

Wordle is a word guessing game where:
- A secret **5-letter word** is chosen
- You have **6 attempts** to guess it
- After each guess you get color-coded feedback:
  - 🟢 **G (Green)** — correct letter, correct position
  - 🟡 **Y (Yellow)** — correct letter, wrong position
  - ⬛ **X (Gray)** — letter not in the word

---

## 🏆 Reward System

The model was trained using 5 reward signals:

| Signal | Reward | Description |
|---|---|---|
| Win the game | +1.0 | All 5 letters correct (GGGGG) |
| Green letters | +0.3 | Correct letter in correct position |
| Yellow letters | +0.1 | Correct letter in wrong position |
| New guess | +0.3 | Not repeating a previous guess |
| Valid word | +0.2 | Guess is exactly 5 letters |

---

## 🚀 Quick Start
```python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    "shaikabdulfahad/wordle-qwen2-mini",
    torch_dtype=torch.float16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(
    "shaikabdulfahad/wordle-qwen2-mini"
)

# System prompt
system_prompt = """You are an expert Wordle solver.
Guess a 5-letter English word each turn.
Feedback: G=correct position, Y=wrong position, X=not in word.
Only respond with your guess in square brackets. Example: [crane]"""

# Ask for a guess
messages = [
    {"role": "system", "content": system_prompt},
    {"role": "user",   "content": "Start! What is your first guess?"},
]

text   = tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=20,
        temperature=0.7,
        do_sample=True,
    )

response = tokenizer.decode(
    outputs[0][inputs["input_ids"].shape[1]:],
    skip_special_tokens=True
)
print("Model guesses:", response)
```

---

## 🔁 Training Pipeline
```
1. Connect to live Wordle environment (TextArena)
       ↓
2. Generate guess using current model
       ↓
3. Send guess to Wordle — get feedback (G/Y/X)
       ↓
4. Calculate reward from 5 signals
       ↓
5. Update model using GRPO
       ↓
6. Repeat for 20 games
```

---

## 📦 Built With

| Tool | Purpose |
|---|---|
| [OpenEnv](https://github.com/meta-pytorch/OpenEnv) | RL environment framework |
| [TextArena](https://huggingface.co/spaces/burtenshaw/textarena) | Live Wordle environment |
| [Hugging Face Transformers](https://huggingface.co/docs/transformers) | Model loading and inference |
| [TRL](https://huggingface.co/docs/trl) | Reinforcement learning for LLMs |
| [Google Colab](https://colab.research.google.com) | Training hardware (T4 GPU) |

---

## ⚠️ Limitations

- Trained for only 20 games — more training would improve performance significantly
- Uses a 0.5B parameter model — larger models would learn better strategies
- Training on T4 GPU limits batch size and training speed
- Model still occasionally repeats guesses despite the repetition penalty

---

## 🔮 Future Improvements

- Train for 1000+ games on A100 GPU
- Use larger model (Qwen2-7B or Qwen3-1.7B)
- Add stronger repetition penalty
- Implement multi-turn conversation memory
- Train on more word games (Quordle, Wordle variants)

---

## 👤 Author

**Shaik Abdul Fahad**
- 🤗 Hugging Face: [shaikabdulfahad](https://huggingface.co/shaikabdulfahad)
- 📦 Spaces: [Word Game](https://huggingface.co/spaces/shaikabdulfahad/word-game-env) | [Echo Env](https://huggingface.co/spaces/shaikabdulfahad/reverse-echo-env)

Built as part of the **OpenEnv Course** — learning to build and deploy RL environments for LLM training.

---

## 📄 License

This model is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0).