Spaces:
Sleeping
Sleeping
| title: OrthoRL — Orthodontic Treatment Planning Environment | |
| emoji: 🦷 | |
| colorFrom: blue | |
| colorTo: green | |
| sdk: docker | |
| app_port: 7860 | |
| tags: | |
| - openenv | |
| - dental | |
| - orthodontics | |
| - reinforcement-learning | |
| - se3 | |
| - medical-ai | |
| # OrthoRL — Orthodontic Treatment Planning RL Environment | |
| > First RL environment for orthodontic aligner staging. 24-step sequential | |
| > SE(3) planning under delayed bone biomechanics, grounded in **200 real | |
| > Tsinghua patients with vertex-segmented landmarks** + 1,063 clinical | |
| > profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1 | |
| > World-Modeling for Professional Tasks). | |
| [](https://huggingface.co/spaces/sri-manikanta/orthorl) | |
| [](#tests) | |
| [](LICENSE) | |
| --- | |
| ## Why this exists | |
| Every year **12 million patients** start clear-aligner treatment. Each | |
| plan is **24 sequential decisions** under biomechanical constraint — get | |
| one wrong and you trigger root resorption or treatment failure. The | |
| published RL work either picks the *final* alignment (CLIK-Diffusion | |
| 2025) or makes coarse extraction choices (Li & Wang 2025); **nobody | |
| plans the 24 intermediate stages**. This environment lets an LLM agent | |
| do exactly that. | |
| ### Economic context: the refinement trap | |
| The clear-aligner market was **$8.29 B in 2025** and is projected to | |
| reach **$56.81 B by 2033** ([Grand View Research][gvr]). Yet only | |
| **6.0% of patients** finish on the original plan; the rest need | |
| refinement scans because the plan failed, averaging **2.5 refinements | |
| per patient**, and **1 in 6 switches from aligners to braces entirely** | |
| because the digital plan never tracked. The mechanism is the one | |
| OrthoRL targets: planners draw straight-line SLERP paths between tooth | |
| poses, but bone remodelling is delayed and biological — so teeth lose | |
| tracking and plans require mid-course corrections. | |
| OpenAI's **GDPval** ([Patwardhan et al. 2025][gdpval]) measured | |
| frontier models against human experts on **44 occupations across the | |
| top-9 GDP-contributing US sectors** and reported 40–49% deliverable | |
| win-rate "approximately 100× faster, at a fraction of the cost." | |
| **Dentistry / orthodontics is not among the 44 occupations.** OrthoRL | |
| is the training environment GDPval skipped: same head-to-head | |
| methodology (trained agent vs SLERP baseline on identical held-out | |
| patients), applied to a domain where reducing refinements saves ~20% | |
| of aligners per case and AI automation is documented at ~80× planning | |
| speedup. | |
| [gvr]: https://www.grandviewresearch.com/industry-analysis/clear-aligners-market | |
| [gdpval]: https://arxiv.org/abs/2510.04374 | |
| --- | |
| ## What it is | |
| | | | | |
| |---|---| | |
| | **Episode** | 24-step sequential commit-and-feedback loop | | |
| | **State** | 28 teeth × `[qw, qx, qy, qz, tx, ty, tz]` (SE(3)) in canonical `dental_v1` frame | | |
| | **Action** | per-stage tooth-fraction plan, parsed by `parse_completion_to_poses` | | |
| | **Tools** | `inspect_tooth`, `simulate_step`, `check_collisions`, `commit_stage`, `rollback_stage`, `diagnose_angle_class`, `measure_crowding`, `measure_overbite` | | |
| | **Five rewards** | `terminal · occlusion · strategy · format · anchorage` (all algorithmic — no LLM-as-judge) | | |
| | **Reward range** | `[-2, +8]` with hard-fail overrides (collision −1.0, PDL stress −0.5) | | |
| | **Datasets** | 1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites | | |
| ### Key innovations | |
| 1. **Pharmacokinetic force decay** — translational deltas are convolved with a 5-tap kernel `[0.10, 0.30, 0.40, 0.15, 0.05]` modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds. | |
| 2. **Empirical anchorage prior** mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = **0.89 mm**, incisor = **2.42 mm** — the agent gets a bounded reward signal that mirrors the population pattern. *Not hand-tuned.* | |
| 3. **Three-tier held-out eval** — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in `server/eval_split.py`; `EvalRegistry.assert_training_legal()` raises on any leakage in train mode (test pinned). | |
| 4. **Mesh-based collision detection** on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases. | |
| 5. **5-stage training pipeline** — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT. | |
| --- | |
| ## Live demo | |
| ```bash | |
| curl https://sri-manikanta-orthorl.hf.space/health | |
| # {"status":"healthy"} | |
| curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \ | |
| -H 'Content-Type: application/json' \ | |
| -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}' | |
| # Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame | |
| ``` | |
| --- | |
| ## Local quick-start | |
| ```bash | |
| git clone https://github.com/mehular0ra/orthorl | |
| cd orthorl | |
| uv sync | |
| uv run python -m server.app # FastAPI on :7860 | |
| curl http://localhost:7860/health | |
| make test # 227 tests pass in ~60 s | |
| ``` | |
| --- | |
| ## Training | |
| The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed | |
| to `sri-manikanta/orthorl-grpo` at the end. | |
| ```bash | |
| # end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40) | |
| bash scripts/run_full_pipeline.sh | |
| # or just one stage (e.g. re-run GRPO without re-running SFT): | |
| STAGES="3" bash scripts/run_full_pipeline.sh | |
| ``` | |
| Colab driver: [`notebooks/colab_a10g.py`](notebooks/colab_a10g.py). | |
| ### Logging | |
| Every run captures (without flag combinations to remember): | |
| | File | Frequency | Content | | |
| |---|---|---| | |
| | `logs/run_<TS>.log` | streaming (tee) | every stdout line of every stage | | |
| | `logs/grpo_samples.jsonl` | every 10 steps | step / loss / lr / per-reward-fn means / completion length | | |
| | `logs/grpo_completions.jsonl` | every 10 steps | full sample completion text + slim breakdown (closes the W&B-only-scalars gap) | | |
| | `checkpoints/grpo/trainer_state.json` | every TRL step | TRL log_history with all reward fns | | |
| | `research/results.tsv` | once at training end | autolog row per `multiautoresearch` discipline | | |
| | `results/reward_curves.png` | once at training end | matplotlib plot of every `rewards/*/mean` | | |
| | W&B (if `WANDB_API_KEY` is in env) | every step | live cloud dashboard (auto-on) | | |
| After training, generate the slide kit: | |
| ```bash | |
| uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3 | |
| uv run python eval.py --policy checkpoints/rft --tier 1 --seeds 3 | |
| uv run python scripts/build_demo_plots.py # 5 PNGs for the deck | |
| uv run python scripts/build_ablation_matrix.py # spec 1.16 table | |
| ``` | |
| --- | |
| ## Results | |
| > **Live numbers populated post-training.** The SLERP baseline below is locked | |
| > in; trained-checkpoint rows fill in from `eval_summary.csv` after the | |
| > A10G run (use `scripts/build_ablation_matrix.py` to regenerate). | |
| | Policy | Tier 1 (Tsinghua test, N=250) | Tier 2 (OFJ, N=17) | Tier 3 (Bits2Bites, N=40) | | |
| |---|---|---|---| | |
| | **SLERP baseline** | **0.7231 ±0.036** | — | — | | |
| | Stage 0 (format SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | |
| | Stage 2 (BC SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | |
| | Stage 3 (GRPO) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | |
| | Stage 4 (RFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | |
| `results/reward_curves.png`, `results/eval_three_tier.png`, | |
| `results/case_type_reward.png`, `results/anchorage_summary.png`, | |
| `results/safeguards.png`, `results/ablation_matrix.png` are the slide- | |
| ready figures the demo deck embeds. | |
| --- | |
| ## Repository layout | |
| ``` | |
| server/ env modules (clinical_profiles, coord_frame, | |
| eval_split, expert_stager, force_decay, | |
| landmark_loader, mesh_collision, | |
| movement_priors, reward_scaler, ...) | |
| scripts/ SFT/GRPO/RFT/eval pipeline drivers | |
| + cache_oracle, build_demo_plots, build_ablation_matrix | |
| specs/ 16 numbered spec files + VALIDATION_TRACKER.md | |
| tests/ 227 passing pytest cases | |
| data/ SFT JSONL datasets (committed; deterministic) | |
| datasets/ landmark/case data (gitignored bulk; case_database.json committed) | |
| checkpoints/ SFT/GRPO/RFT adapters (gitignored) | |
| results/ eval CSV + slide-ready PNGs | |
| research/ pitch_script.md, results.tsv, code_review_*.md | |
| notebooks/colab_a10g.py one-cell launcher for the full pipeline | |
| train_grpo.py main GRPO entrypoint | |
| eval.py held-out evaluation CLI | |
| prepare.py / server/grader.py IMMUTABLE benchmark (read-only) | |
| ``` | |
| ## Tests | |
| ``` | |
| make test # 227 passing in ~60 s | |
| make fast-check # high-value subset (<5 s) used by the pre-commit hook | |
| make install-precommit # repo-local hook so the next regression doesn't ship | |
| ``` | |
| ## Specs | |
| 22 numbered specs in `specs/`. Tier 1 (must-haves): 1.1–1.16, all | |
| shipped except 1.16 (the ablation matrix table — auto-fills post-eval). | |
| Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals. | |
| ## References | |
| - Wang et al. (2024). *Nature Scientific Data* 11:1277. DOI: 10.1038/s41597-024-04138-7 | |
| - Andrews LF (1972). "The six keys to normal occlusion." *Am J Orthod* 62(3):296–309 | |
| - Proffit WR. *Contemporary Orthodontics* 6e Ch. 8 — bone remodelling timeline | |
| - Cattaneo PM et al. (2005). "Moment-to-force ratio." *Am J Orthod Dentofacial Orthop* | |
| - Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300 | |
| - Yuan et al. (2023). RFT. arXiv:2308.01825 | |
| - Patwardhan et al. (2025). *GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks.* arXiv:2510.04374 — economic-importance frame | |
| - Grand View Research. *Clear Aligners Market Report* (2025) — market size, refinement-rate context | |
| ## Team | |
| - **Mehul Arora** — Orthogonal Research and Education Lab | |
| - **Vivek Mathur** — M.S. by Research, IIIT Hyderabad | |
| - **Prof. Bradly Alicea** — UIUC / Orthogonal Research Lab | |
| ## License | |
| MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites | |
| licenses. | |