--- language: - en tags: - reinforcement-learning - wordle - game-ai - grpo - qwen2 - text-generation - fine-tuned license: apache-2.0 base_model: Qwen/Qwen2-0.5B-Instruct --- # 🎯 Wordle AI — Fine-tuned with GRPO > A language model trained to play Wordle using Group Relative Policy Optimization (GRPO) reinforcement learning. [![Model](https://img.shields.io/badge/Model-Qwen2--0.5B-blue)](https://huggingface.co/Qwen/Qwen2-0.5B-Instruct) [![License](https://img.shields.io/badge/License-Apache%202.0-green)](https://opensource.org/licenses/Apache-2.0) [![Framework](https://img.shields.io/badge/Framework-Transformers-orange)](https://huggingface.co/docs/transformers) [![RL Algorithm](https://img.shields.io/badge/RL-GRPO-purple)](https://huggingface.co/docs/trl) --- ## 📖 Overview This model is a fine-tuned version of **Qwen2-0.5B-Instruct** trained to play the popular word game **Wordle** using reinforcement learning. Instead of supervised learning from human examples, this model learned purely from reward signals — improving its strategy game by game through the GRPO algorithm. The model learns strategies like: - Opening with vowel-rich words like **CRANE** or **SLATE** - Using green letter positions in subsequent guesses - Repositioning yellow letters correctly - Never repeating previously guessed words --- ## 🏗️ Model Details | Property | Value | |---|---| | **Base Model** | Qwen/Qwen2-0.5B-Instruct | | **Model Size** | 0.5B parameters | | **Tensor Type** | F16 | | **Training Algorithm** | GRPO (Group Relative Policy Optimization) | | **Training Games** | 20 | | **Hardware** | NVIDIA T4 GPU | | **Framework** | Hugging Face Transformers + TRL | | **Environment** | OpenEnv + TextArena Wordle | --- ## 🎮 What is Wordle? Wordle is a word guessing game where: - A secret **5-letter word** is chosen - You have **6 attempts** to guess it - After each guess you get color-coded feedback: - 🟢 **G (Green)** — correct letter, correct position - 🟡 **Y (Yellow)** — correct letter, wrong position - ⬛ **X (Gray)** — letter not in the word --- ## 🏆 Reward System The model was trained using 5 reward signals: | Signal | Reward | Description | |---|---|---| | Win the game | +1.0 | All 5 letters correct (GGGGG) | | Green letters | +0.3 | Correct letter in correct position | | Yellow letters | +0.1 | Correct letter in wrong position | | New guess | +0.3 | Not repeating a previous guess | | Valid word | +0.2 | Guess is exactly 5 letters | --- ## 🚀 Quick Start ```python from transformers import AutoTokenizer, AutoModelForCausalLM import torch # Load model and tokenizer model = AutoModelForCausalLM.from_pretrained( "shaikabdulfahad/wordle-qwen2-mini", torch_dtype=torch.float16, device_map="auto", ) tokenizer = AutoTokenizer.from_pretrained( "shaikabdulfahad/wordle-qwen2-mini" ) # System prompt system_prompt = """You are an expert Wordle solver. Guess a 5-letter English word each turn. Feedback: G=correct position, Y=wrong position, X=not in word. Only respond with your guess in square brackets. Example: [crane]""" # Ask for a guess messages = [ {"role": "system", "content": system_prompt}, {"role": "user", "content": "Start! What is your first guess?"}, ] text = tokenizer.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) inputs = tokenizer(text, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=20, temperature=0.7, do_sample=True, ) response = tokenizer.decode( outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True ) print("Model guesses:", response) ``` --- ## 🔁 Training Pipeline ``` 1. Connect to live Wordle environment (TextArena) ↓ 2. Generate guess using current model ↓ 3. Send guess to Wordle — get feedback (G/Y/X) ↓ 4. Calculate reward from 5 signals ↓ 5. Update model using GRPO ↓ 6. Repeat for 20 games ``` --- ## 📦 Built With | Tool | Purpose | |---|---| | [OpenEnv](https://github.com/meta-pytorch/OpenEnv) | RL environment framework | | [TextArena](https://huggingface.co/spaces/burtenshaw/textarena) | Live Wordle environment | | [Hugging Face Transformers](https://huggingface.co/docs/transformers) | Model loading and inference | | [TRL](https://huggingface.co/docs/trl) | Reinforcement learning for LLMs | | [Google Colab](https://colab.research.google.com) | Training hardware (T4 GPU) | --- ## ⚠️ Limitations - Trained for only 20 games — more training would improve performance significantly - Uses a 0.5B parameter model — larger models would learn better strategies - Training on T4 GPU limits batch size and training speed - Model still occasionally repeats guesses despite the repetition penalty --- ## 🔮 Future Improvements - Train for 1000+ games on A100 GPU - Use larger model (Qwen2-7B or Qwen3-1.7B) - Add stronger repetition penalty - Implement multi-turn conversation memory - Train on more word games (Quordle, Wordle variants) --- ## 👤 Author **Shaik Abdul Fahad** - 🤗 Hugging Face: [shaikabdulfahad](https://huggingface.co/shaikabdulfahad) - 📦 Spaces: [Word Game](https://huggingface.co/spaces/shaikabdulfahad/word-game-env) | [Echo Env](https://huggingface.co/spaces/shaikabdulfahad/reverse-echo-env) Built as part of the **OpenEnv Course** — learning to build and deploy RL environments for LLM training. --- ## 📄 License This model is released under the [Apache 2.0 License](https://opensource.org/licenses/Apache-2.0).