---
license: other
license_name: lfm1.0
license_link: https://www.liquid.ai/legal/lfm-license
base_model:
- anakin87/LFM2-2.6B-ttt-sft
library_name: peft
tags:
- rl
- cispo
- tictactoe
- rlvr
- lora
pipeline_tag: text-generation
language:
- en
---

# LFM2-2.6B-ttt-rl

LoRA adapter (rank 8) from the first round of CISPO training for Tic Tac Toe, applied on top of [anakin87/LFM2-2.6B-ttt-sft](https://huggingface.co/anakin87/LFM2-2.6B-ttt-sft).

This adapter must be loaded on top of the SFT base model. The merged version is available as [anakin87/LFM2-2.6B-ttt-rl-merged](https://huggingface.co/anakin87/LFM2-2.6B-ttt-rl-merged).

This is an intermediate checkpoint from 🎓 **[LLM RL Environments Lil Course](https://github.com/anakin87/llm-rl-environments-lil-course)**, a hands-on course on building RL environments for Language Models, where models learn from rewards, not examples. It walks through the full process of turning a small open model into a specialist that outperforms a large proprietary one on a specific task (Tic Tac Toe). The final model is [anakin87/LFM2-2.6B-mr-tictactoe](https://huggingface.co/anakin87/LFM2-2.6B-mr-tictactoe).

🤗🕹️ **[Play against the final model](https://huggingface.co/spaces/anakin87/LFM2-2.6B-mr-tictactoe)**

## Training

- **Algorithm:** CISPO via [Verifiers](https://github.com/PrimeIntellect-ai/verifiers) RLTrainer
- **Environment:** [anakin87/tictactoe](https://app.primeintellect.ai/dashboard/environments/anakin87/tictactoe)
- **Opponents:** 20-70% random move probability
- **Steps:** 600, batch size 256, lr 5e-5, LoRA rank 8
- **Hardware:** 2x NVIDIA RTX Pro 6000 96GB (~8 hours)

## Evaluation (merged)

100 games per setting.

| **Model vs random opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
|------------------------------|------------|-------------|--------------|----------------------|---------------------|
| LiquidAI/LFM2-2.6B | 40 | 11 | 49 | 27.8 | 40 |
| anakin87/LFM2-2.6B-ttt-sft | 74 | 13 | 13 | 99.8 | 11 |
| **anakin87/LFM2-2.6B-ttt-rl** | **86** | **12** | **2** | **100** | **1** |
| | | | | | |
| **Model vs optimal opponent** | **% Wins** | **% Draws** | **% Losses** | **% Follows format** | **% Games w invalid moves** |
| LiquidAI/LFM2-2.6B | 0 | 11 | 89 | 24.7 | 43 |
| anakin87/LFM2-2.6B-ttt-sft | 0 | 52 | 48 | 99 | 14 |
| **anakin87/LFM2-2.6B-ttt-rl** | **0** | **85** | **15** | **100** | **1** |

Competent player, but still falls into fork traps against the optimal opponent.