---
title: OrthoRL — Orthodontic Treatment Planning Environment
emoji: 🦷
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - dental
  - orthodontics
  - reinforcement-learning
  - se3
  - medical-ai
---

# OrthoRL — Orthodontic Treatment Planning RL Environment

> First RL environment for orthodontic aligner staging. 24-step sequential
> SE(3) planning under delayed bone biomechanics, grounded in **200 real
> Tsinghua patients with vertex-segmented landmarks** + 1,063 clinical
> profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1
> World-Modeling for Professional Tasks).

[![Live Space](https://img.shields.io/badge/HF%20Space-live-success)](https://huggingface.co/spaces/sri-manikanta/orthorl)
[![Tests](https://img.shields.io/badge/tests-227%20passing-brightgreen)](#tests)
[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

---

## Why this exists

Every year **12 million patients** start clear-aligner treatment. Each
plan is **24 sequential decisions** under biomechanical constraint — get
one wrong and you trigger root resorption or treatment failure. The
published RL work either picks the *final* alignment (CLIK-Diffusion
2025) or makes coarse extraction choices (Li & Wang 2025); **nobody
plans the 24 intermediate stages**. This environment lets an LLM agent
do exactly that.

### Economic context: the refinement trap

The clear-aligner market was **$8.29 B in 2025** and is projected to
reach **$56.81 B by 2033** ([Grand View Research][gvr]). Yet only
**6.0% of patients** finish on the original plan; the rest need
refinement scans because the plan failed, averaging **2.5 refinements
per patient**, and **1 in 6 switches from aligners to braces entirely**
because the digital plan never tracked. The mechanism is the one
OrthoRL targets: planners draw straight-line SLERP paths between tooth
poses, but bone remodelling is delayed and biological — so teeth lose
tracking and plans require mid-course corrections.

OpenAI's **GDPval** ([Patwardhan et al. 2025][gdpval]) measured
frontier models against human experts on **44 occupations across the
top-9 GDP-contributing US sectors** and reported 40–49% deliverable
win-rate "approximately 100× faster, at a fraction of the cost."
**Dentistry / orthodontics is not among the 44 occupations.** OrthoRL
is the training environment GDPval skipped: same head-to-head
methodology (trained agent vs SLERP baseline on identical held-out
patients), applied to a domain where reducing refinements saves ~20%
of aligners per case and AI automation is documented at ~80× planning
speedup.

[gvr]: https://www.grandviewresearch.com/industry-analysis/clear-aligners-market
[gdpval]: https://arxiv.org/abs/2510.04374

---

## What it is

| | |
|---|---|
| **Episode** | 24-step sequential commit-and-feedback loop |
| **State** | 28 teeth × `[qw, qx, qy, qz, tx, ty, tz]` (SE(3)) in canonical `dental_v1` frame |
| **Action** | per-stage tooth-fraction plan, parsed by `parse_completion_to_poses` |
| **Tools** | `inspect_tooth`, `simulate_step`, `check_collisions`, `commit_stage`, `rollback_stage`, `diagnose_angle_class`, `measure_crowding`, `measure_overbite` |
| **Five rewards** | `terminal · occlusion · strategy · format · anchorage` (all algorithmic — no LLM-as-judge) |
| **Reward range** | `[-2, +8]` with hard-fail overrides (collision −1.0, PDL stress −0.5) |
| **Datasets** | 1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites |

### Key innovations

1. **Pharmacokinetic force decay** — translational deltas are convolved with a 5-tap kernel `[0.10, 0.30, 0.40, 0.15, 0.05]` modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds.
2. **Empirical anchorage prior** mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = **0.89 mm**, incisor = **2.42 mm** — the agent gets a bounded reward signal that mirrors the population pattern. *Not hand-tuned.*
3. **Three-tier held-out eval** — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in `server/eval_split.py`; `EvalRegistry.assert_training_legal()` raises on any leakage in train mode (test pinned).
4. **Mesh-based collision detection** on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases.
5. **5-stage training pipeline** — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT.

---

## Live demo

```bash
curl https://sri-manikanta-orthorl.hf.space/health
# {"status":"healthy"}

curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \
  -H 'Content-Type: application/json' \
  -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'
# Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame
```

---

## Local quick-start

```bash
git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app                  # FastAPI on :7860
curl http://localhost:7860/health
make test                                    # 227 tests pass in ~60 s
```

---

## Training

The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed
to `sri-manikanta/orthorl-grpo` at the end.

```bash
# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh

# or just one stage (e.g. re-run GRPO without re-running SFT):
STAGES="3" bash scripts/run_full_pipeline.sh
```

Colab driver: [`notebooks/colab_a10g.py`](notebooks/colab_a10g.py).

### Logging

Every run captures (without flag combinations to remember):

| File | Frequency | Content |
|---|---|---|
| `logs/run_<TS>.log` | streaming (tee) | every stdout line of every stage |
| `logs/grpo_samples.jsonl` | every 10 steps | step / loss / lr / per-reward-fn means / completion length |
| `logs/grpo_completions.jsonl` | every 10 steps | full sample completion text + slim breakdown (closes the W&B-only-scalars gap) |
| `checkpoints/grpo/trainer_state.json` | every TRL step | TRL log_history with all reward fns |
| `research/results.tsv` | once at training end | autolog row per `multiautoresearch` discipline |
| `results/reward_curves.png` | once at training end | matplotlib plot of every `rewards/*/mean` |
| W&B (if `WANDB_API_KEY` is in env) | every step | live cloud dashboard (auto-on) |

After training, generate the slide kit:

```bash
uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft  --tier 1 --seeds 3
uv run python scripts/build_demo_plots.py        # 5 PNGs for the deck
uv run python scripts/build_ablation_matrix.py    # spec 1.16 table
```

---

## Results

> **Live numbers populated post-training.** The SLERP baseline below is locked
> in; trained-checkpoint rows fill in from `eval_summary.csv` after the
> A10G run (use `scripts/build_ablation_matrix.py` to regenerate).

| Policy | Tier 1 (Tsinghua test, N=250) | Tier 2 (OFJ, N=17) | Tier 3 (Bits2Bites, N=40) |
|---|---|---|---|
| **SLERP baseline** | **0.7231 ±0.036** | — | — |
| Stage 0 (format SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` |
| Stage 2 (BC SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` |
| Stage 3 (GRPO) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` |
| Stage 4 (RFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` |

`results/reward_curves.png`, `results/eval_three_tier.png`,
`results/case_type_reward.png`, `results/anchorage_summary.png`,
`results/safeguards.png`, `results/ablation_matrix.png` are the slide-
ready figures the demo deck embeds.

---

## Repository layout

```
server/                      env modules (clinical_profiles, coord_frame,
                             eval_split, expert_stager, force_decay,
                             landmark_loader, mesh_collision,
                             movement_priors, reward_scaler, ...)
scripts/                     SFT/GRPO/RFT/eval pipeline drivers
                             + cache_oracle, build_demo_plots, build_ablation_matrix
specs/                       16 numbered spec files + VALIDATION_TRACKER.md
tests/                       227 passing pytest cases
data/                        SFT JSONL datasets (committed; deterministic)
datasets/                    landmark/case data (gitignored bulk; case_database.json committed)
checkpoints/                 SFT/GRPO/RFT adapters (gitignored)
results/                     eval CSV + slide-ready PNGs
research/                    pitch_script.md, results.tsv, code_review_*.md
notebooks/colab_a10g.py      one-cell launcher for the full pipeline
train_grpo.py                main GRPO entrypoint
eval.py                      held-out evaluation CLI
prepare.py / server/grader.py  IMMUTABLE benchmark (read-only)
```

## Tests

```
make test           # 227 passing in ~60 s
make fast-check     # high-value subset (<5 s) used by the pre-commit hook
make install-precommit  # repo-local hook so the next regression doesn't ship
```

## Specs

22 numbered specs in `specs/`. Tier 1 (must-haves): 1.1–1.16, all
shipped except 1.16 (the ablation matrix table — auto-fills post-eval).
Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals.

## References

- Wang et al. (2024). *Nature Scientific Data* 11:1277. DOI: 10.1038/s41597-024-04138-7
- Andrews LF (1972). "The six keys to normal occlusion." *Am J Orthod* 62(3):296–309
- Proffit WR. *Contemporary Orthodontics* 6e Ch. 8 — bone remodelling timeline
- Cattaneo PM et al. (2005). "Moment-to-force ratio." *Am J Orthod Dentofacial Orthop*
- Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300
- Yuan et al. (2023). RFT. arXiv:2308.01825
- Patwardhan et al. (2025). *GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks.* arXiv:2510.04374 — economic-importance frame
- Grand View Research. *Clear Aligners Market Report* (2025) — market size, refinement-rate context

## Team

- **Mehul Arora** — Orthogonal Research and Education Lab
- **Vivek Mathur** — M.S. by Research, IIIT Hyderabad
- **Prof. Bradly Alicea** — UIUC / Orthogonal Research Lab

## License

MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites
licenses.