Spaces:
Sleeping
title: OrthoRL — Orthodontic Treatment Planning Environment
emoji: 🦷
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
- openenv
- dental
- orthodontics
- reinforcement-learning
- se3
- medical-ai
OrthoRL — Orthodontic Treatment Planning RL Environment
First RL environment for orthodontic aligner staging. 24-step sequential SE(3) planning under delayed bone biomechanics, grounded in 200 real Tsinghua patients with vertex-segmented landmarks + 1,063 clinical profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1 World-Modeling for Professional Tasks).
Why this exists
Every year 12 million patients start clear-aligner treatment. Each plan is 24 sequential decisions under biomechanical constraint — get one wrong and you trigger root resorption or treatment failure. The published RL work either picks the final alignment (CLIK-Diffusion 2025) or makes coarse extraction choices (Li & Wang 2025); nobody plans the 24 intermediate stages. This environment lets an LLM agent do exactly that.
Economic context: the refinement trap
The clear-aligner market was $8.29 B in 2025 and is projected to reach $56.81 B by 2033 (Grand View Research). Yet only 6.0% of patients finish on the original plan; the rest need refinement scans because the plan failed, averaging 2.5 refinements per patient, and 1 in 6 switches from aligners to braces entirely because the digital plan never tracked. The mechanism is the one OrthoRL targets: planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological — so teeth lose tracking and plans require mid-course corrections.
OpenAI's GDPval (Patwardhan et al. 2025) measured frontier models against human experts on 44 occupations across the top-9 GDP-contributing US sectors and reported 40–49% deliverable win-rate "approximately 100× faster, at a fraction of the cost." Dentistry / orthodontics is not among the 44 occupations. OrthoRL is the training environment GDPval skipped: same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where reducing refinements saves ~20% of aligners per case and AI automation is documented at ~80× planning speedup.
What it is
| Episode | 24-step sequential commit-and-feedback loop |
| State | 28 teeth × [qw, qx, qy, qz, tx, ty, tz] (SE(3)) in canonical dental_v1 frame |
| Action | per-stage tooth-fraction plan, parsed by parse_completion_to_poses |
| Tools | inspect_tooth, simulate_step, check_collisions, commit_stage, rollback_stage, diagnose_angle_class, measure_crowding, measure_overbite |
| Five rewards | terminal · occlusion · strategy · format · anchorage (all algorithmic — no LLM-as-judge) |
| Reward range | [-2, +8] with hard-fail overrides (collision −1.0, PDL stress −0.5) |
| Datasets | 1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites |
Key innovations
- Pharmacokinetic force decay — translational deltas are convolved with a 5-tap kernel
[0.10, 0.30, 0.40, 0.15, 0.05]modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds. - Empirical anchorage prior mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = 0.89 mm, incisor = 2.42 mm — the agent gets a bounded reward signal that mirrors the population pattern. Not hand-tuned.
- Three-tier held-out eval — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in
server/eval_split.py;EvalRegistry.assert_training_legal()raises on any leakage in train mode (test pinned). - Mesh-based collision detection on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases.
- 5-stage training pipeline — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT.
Live demo
curl https://sri-manikanta-orthorl.hf.space/health
# {"status":"healthy"}
curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \
-H 'Content-Type: application/json' \
-d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'
# Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame
Local quick-start
git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app # FastAPI on :7860
curl http://localhost:7860/health
make test # 227 tests pass in ~60 s
Training
The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed
to sri-manikanta/orthorl-grpo at the end.
# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh
# or just one stage (e.g. re-run GRPO without re-running SFT):
STAGES="3" bash scripts/run_full_pipeline.sh
Colab driver: notebooks/colab_a10g.py.
Logging
Every run captures (without flag combinations to remember):
| File | Frequency | Content |
|---|---|---|
logs/run_<TS>.log |
streaming (tee) | every stdout line of every stage |
logs/grpo_samples.jsonl |
every 10 steps | step / loss / lr / per-reward-fn means / completion length |
logs/grpo_completions.jsonl |
every 10 steps | full sample completion text + slim breakdown (closes the W&B-only-scalars gap) |
checkpoints/grpo/trainer_state.json |
every TRL step | TRL log_history with all reward fns |
research/results.tsv |
once at training end | autolog row per multiautoresearch discipline |
results/reward_curves.png |
once at training end | matplotlib plot of every rewards/*/mean |
W&B (if WANDB_API_KEY is in env) |
every step | live cloud dashboard (auto-on) |
After training, generate the slide kit:
uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft --tier 1 --seeds 3
uv run python scripts/build_demo_plots.py # 5 PNGs for the deck
uv run python scripts/build_ablation_matrix.py # spec 1.16 table
Results
Live numbers populated post-training. The SLERP baseline below is locked in; trained-checkpoint rows fill in from
eval_summary.csvafter the A10G run (usescripts/build_ablation_matrix.pyto regenerate).
| Policy | Tier 1 (Tsinghua test, N=250) | Tier 2 (OFJ, N=17) | Tier 3 (Bits2Bites, N=40) |
|---|---|---|---|
| SLERP baseline | 0.7231 ±0.036 | — | — |
| Stage 0 (format SFT) | {{ post-training }} |
{{ post-training }} |
{{ post-training }} |
| Stage 2 (BC SFT) | {{ post-training }} |
{{ post-training }} |
{{ post-training }} |
| Stage 3 (GRPO) | {{ post-training }} |
{{ post-training }} |
{{ post-training }} |
| Stage 4 (RFT) | {{ post-training }} |
{{ post-training }} |
{{ post-training }} |
results/reward_curves.png, results/eval_three_tier.png,
results/case_type_reward.png, results/anchorage_summary.png,
results/safeguards.png, results/ablation_matrix.png are the slide-
ready figures the demo deck embeds.
Repository layout
server/ env modules (clinical_profiles, coord_frame,
eval_split, expert_stager, force_decay,
landmark_loader, mesh_collision,
movement_priors, reward_scaler, ...)
scripts/ SFT/GRPO/RFT/eval pipeline drivers
+ cache_oracle, build_demo_plots, build_ablation_matrix
specs/ 16 numbered spec files + VALIDATION_TRACKER.md
tests/ 227 passing pytest cases
data/ SFT JSONL datasets (committed; deterministic)
datasets/ landmark/case data (gitignored bulk; case_database.json committed)
checkpoints/ SFT/GRPO/RFT adapters (gitignored)
results/ eval CSV + slide-ready PNGs
research/ pitch_script.md, results.tsv, code_review_*.md
notebooks/colab_a10g.py one-cell launcher for the full pipeline
train_grpo.py main GRPO entrypoint
eval.py held-out evaluation CLI
prepare.py / server/grader.py IMMUTABLE benchmark (read-only)
Tests
make test # 227 passing in ~60 s
make fast-check # high-value subset (<5 s) used by the pre-commit hook
make install-precommit # repo-local hook so the next regression doesn't ship
Specs
22 numbered specs in specs/. Tier 1 (must-haves): 1.1–1.16, all
shipped except 1.16 (the ablation matrix table — auto-fills post-eval).
Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals.
References
- Wang et al. (2024). Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
- Andrews LF (1972). "The six keys to normal occlusion." Am J Orthod 62(3):296–309
- Proffit WR. Contemporary Orthodontics 6e Ch. 8 — bone remodelling timeline
- Cattaneo PM et al. (2005). "Moment-to-force ratio." Am J Orthod Dentofacial Orthop
- Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300
- Yuan et al. (2023). RFT. arXiv:2308.01825
- Patwardhan et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 — economic-importance frame
- Grand View Research. Clear Aligners Market Report (2025) — market size, refinement-rate context
Team
- Mehul Arora — Orthogonal Research and Education Lab
- Vivek Mathur — M.S. by Research, IIIT Hyderabad
- Prof. Bradly Alicea — UIUC / Orthogonal Research Lab
License
MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites licenses.