--- title: OrthoRL โ€” Orthodontic Treatment Planning Environment emoji: ๐Ÿฆท colorFrom: blue colorTo: green sdk: docker app_port: 7860 tags: - openenv - dental - orthodontics - reinforcement-learning - se3 - medical-ai --- # OrthoRL โ€” Orthodontic Treatment Planning RL Environment > First RL environment for orthodontic aligner staging. 24-step sequential > SE(3) planning under delayed bone biomechanics, grounded in **200 real > Tsinghua patients with vertex-segmented landmarks** + 1,063 clinical > profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1 > World-Modeling for Professional Tasks). [![Live Space](https://img.shields.io/badge/HF%20Space-live-success)](https://huggingface.co/spaces/sri-manikanta/orthorl) [![Tests](https://img.shields.io/badge/tests-227%20passing-brightgreen)](#tests) [![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE) --- ## Why this exists Every year **12 million patients** start clear-aligner treatment. Each plan is **24 sequential decisions** under biomechanical constraint โ€” get one wrong and you trigger root resorption or treatment failure. The published RL work either picks the *final* alignment (CLIK-Diffusion 2025) or makes coarse extraction choices (Li & Wang 2025); **nobody plans the 24 intermediate stages**. This environment lets an LLM agent do exactly that. ### Economic context: the refinement trap The clear-aligner market was **$8.29 B in 2025** and is projected to reach **$56.81 B by 2033** ([Grand View Research][gvr]). Yet only **6.0% of patients** finish on the original plan; the rest need refinement scans because the plan failed, averaging **2.5 refinements per patient**, and **1 in 6 switches from aligners to braces entirely** because the digital plan never tracked. The mechanism is the one OrthoRL targets: planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological โ€” so teeth lose tracking and plans require mid-course corrections. OpenAI's **GDPval** ([Patwardhan et al. 2025][gdpval]) measured frontier models against human experts on **44 occupations across the top-9 GDP-contributing US sectors** and reported 40โ€“49% deliverable win-rate "approximately 100ร— faster, at a fraction of the cost." **Dentistry / orthodontics is not among the 44 occupations.** OrthoRL is the training environment GDPval skipped: same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where reducing refinements saves ~20% of aligners per case and AI automation is documented at ~80ร— planning speedup. [gvr]: https://www.grandviewresearch.com/industry-analysis/clear-aligners-market [gdpval]: https://arxiv.org/abs/2510.04374 --- ## What it is | | | |---|---| | **Episode** | 24-step sequential commit-and-feedback loop | | **State** | 28 teeth ร— `[qw, qx, qy, qz, tx, ty, tz]` (SE(3)) in canonical `dental_v1` frame | | **Action** | per-stage tooth-fraction plan, parsed by `parse_completion_to_poses` | | **Tools** | `inspect_tooth`, `simulate_step`, `check_collisions`, `commit_stage`, `rollback_stage`, `diagnose_angle_class`, `measure_crowding`, `measure_overbite` | | **Five rewards** | `terminal ยท occlusion ยท strategy ยท format ยท anchorage` (all algorithmic โ€” no LLM-as-judge) | | **Reward range** | `[-2, +8]` with hard-fail overrides (collision โˆ’1.0, PDL stress โˆ’0.5) | | **Datasets** | 1,063 Tsinghua profiles ยท 200 Tsinghua landmark patients ยท 17 Open-Full-Jaw ยท 200 Bits2Bites | ### Key innovations 1. **Pharmacokinetic force decay** โ€” translational deltas are convolved with a 5-tap kernel `[0.10, 0.30, 0.40, 0.15, 0.05]` modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds. 2. **Empirical anchorage prior** mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = **0.89 mm**, incisor = **2.42 mm** โ€” the agent gets a bounded reward signal that mirrors the population pattern. *Not hand-tuned.* 3. **Three-tier held-out eval** โ€” 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in `server/eval_split.py`; `EvalRegistry.assert_training_legal()` raises on any leakage in train mode (test pinned). 4. **Mesh-based collision detection** on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases. 5. **5-stage training pipeline** โ€” Format SFT โ†’ Tool-use SFT โ†’ Behavioural-cloning SFT โ†’ GRPO (5 reward funcs) โ†’ Rejection-sampling FT. --- ## Live demo ```bash curl https://sri-manikanta-orthorl.hf.space/health # {"status":"healthy"} curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \ -H 'Content-Type: application/json' \ -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}' # Real Patient 0001 (Tsinghua) โ†’ 28ร—7 poses in dental_v1 frame ``` --- ## Local quick-start ```bash git clone https://github.com/mehular0ra/orthorl cd orthorl uv sync uv run python -m server.app # FastAPI on :7860 curl http://localhost:7860/health make test # 227 tests pass in ~60 s ``` --- ## Training The full SFT โ†’ GRPO โ†’ RFT chain on a single A10G or T4. Adapter pushed to `sri-manikanta/orthorl-grpo` at the end. ```bash # end-to-end on a GPU host (~3.5 h on A10G โ‰ˆ $3.50; ~8.5 h on T4 โ‰ˆ $3.40) bash scripts/run_full_pipeline.sh # or just one stage (e.g. re-run GRPO without re-running SFT): STAGES="3" bash scripts/run_full_pipeline.sh ``` Colab driver: [`notebooks/colab_a10g.py`](notebooks/colab_a10g.py). ### Logging Every run captures (without flag combinations to remember): | File | Frequency | Content | |---|---|---| | `logs/run_.log` | streaming (tee) | every stdout line of every stage | | `logs/grpo_samples.jsonl` | every 10 steps | step / loss / lr / per-reward-fn means / completion length | | `logs/grpo_completions.jsonl` | every 10 steps | full sample completion text + slim breakdown (closes the W&B-only-scalars gap) | | `checkpoints/grpo/trainer_state.json` | every TRL step | TRL log_history with all reward fns | | `research/results.tsv` | once at training end | autolog row per `multiautoresearch` discipline | | `results/reward_curves.png` | once at training end | matplotlib plot of every `rewards/*/mean` | | W&B (if `WANDB_API_KEY` is in env) | every step | live cloud dashboard (auto-on) | After training, generate the slide kit: ```bash uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3 uv run python eval.py --policy checkpoints/rft --tier 1 --seeds 3 uv run python scripts/build_demo_plots.py # 5 PNGs for the deck uv run python scripts/build_ablation_matrix.py # spec 1.16 table ``` --- ## Results > **Live numbers populated post-training.** The SLERP baseline below is locked > in; trained-checkpoint rows fill in from `eval_summary.csv` after the > A10G run (use `scripts/build_ablation_matrix.py` to regenerate). | Policy | Tier 1 (Tsinghua test, N=250) | Tier 2 (OFJ, N=17) | Tier 3 (Bits2Bites, N=40) | |---|---|---|---| | **SLERP baseline** | **0.7231 ยฑ0.036** | โ€” | โ€” | | Stage 0 (format SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | Stage 2 (BC SFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | Stage 3 (GRPO) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | | Stage 4 (RFT) | `{{ post-training }}` | `{{ post-training }}` | `{{ post-training }}` | `results/reward_curves.png`, `results/eval_three_tier.png`, `results/case_type_reward.png`, `results/anchorage_summary.png`, `results/safeguards.png`, `results/ablation_matrix.png` are the slide- ready figures the demo deck embeds. --- ## Repository layout ``` server/ env modules (clinical_profiles, coord_frame, eval_split, expert_stager, force_decay, landmark_loader, mesh_collision, movement_priors, reward_scaler, ...) scripts/ SFT/GRPO/RFT/eval pipeline drivers + cache_oracle, build_demo_plots, build_ablation_matrix specs/ 16 numbered spec files + VALIDATION_TRACKER.md tests/ 227 passing pytest cases data/ SFT JSONL datasets (committed; deterministic) datasets/ landmark/case data (gitignored bulk; case_database.json committed) checkpoints/ SFT/GRPO/RFT adapters (gitignored) results/ eval CSV + slide-ready PNGs research/ pitch_script.md, results.tsv, code_review_*.md notebooks/colab_a10g.py one-cell launcher for the full pipeline train_grpo.py main GRPO entrypoint eval.py held-out evaluation CLI prepare.py / server/grader.py IMMUTABLE benchmark (read-only) ``` ## Tests ``` make test # 227 passing in ~60 s make fast-check # high-value subset (<5 s) used by the pre-commit hook make install-precommit # repo-local hook so the next regression doesn't ship ``` ## Specs 22 numbered specs in `specs/`. Tier 1 (must-haves): 1.1โ€“1.16, all shipped except 1.16 (the ablation matrix table โ€” auto-fills post-eval). Tier 2 (separators): 2.1โ€“2.6, all shipped. Tier 3+ are reach goals. ## References - Wang et al. (2024). *Nature Scientific Data* 11:1277. DOI: 10.1038/s41597-024-04138-7 - Andrews LF (1972). "The six keys to normal occlusion." *Am J Orthod* 62(3):296โ€“309 - Proffit WR. *Contemporary Orthodontics* 6e Ch. 8 โ€” bone remodelling timeline - Cattaneo PM et al. (2005). "Moment-to-force ratio." *Am J Orthod Dentofacial Orthop* - Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300 - Yuan et al. (2023). RFT. arXiv:2308.01825 - Patwardhan et al. (2025). *GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks.* arXiv:2510.04374 โ€” economic-importance frame - Grand View Research. *Clear Aligners Market Report* (2025) โ€” market size, refinement-rate context ## Team - **Mehul Arora** โ€” Orthogonal Research and Education Lab - **Vivek Mathur** โ€” M.S. by Research, IIIT Hyderabad - **Prof. Bradly Alicea** โ€” UIUC / Orthogonal Research Lab ## License MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites licenses.