orthorl / README.md
sri-manikanta's picture
Sync code: Tier-2 OFJ wiring + Environment Control tags + incremental adapter push
b1e16c2 verified
|
Raw
History Blame Contribute Delete
10.4 kB
metadata
title: OrthoRL  Orthodontic Treatment Planning Environment
emoji: 🦷
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - dental
  - orthodontics
  - reinforcement-learning
  - se3
  - medical-ai

OrthoRL — Orthodontic Treatment Planning RL Environment

First RL environment for orthodontic aligner staging. 24-step sequential SE(3) planning under delayed bone biomechanics, grounded in 200 real Tsinghua patients with vertex-segmented landmarks + 1,063 clinical profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1 World-Modeling for Professional Tasks).

Live Space Tests License


Why this exists

Every year 12 million patients start clear-aligner treatment. Each plan is 24 sequential decisions under biomechanical constraint — get one wrong and you trigger root resorption or treatment failure. The published RL work either picks the final alignment (CLIK-Diffusion 2025) or makes coarse extraction choices (Li & Wang 2025); nobody plans the 24 intermediate stages. This environment lets an LLM agent do exactly that.

Economic context: the refinement trap

The clear-aligner market was $8.29 B in 2025 and is projected to reach $56.81 B by 2033 (Grand View Research). Yet only 6.0% of patients finish on the original plan; the rest need refinement scans because the plan failed, averaging 2.5 refinements per patient, and 1 in 6 switches from aligners to braces entirely because the digital plan never tracked. The mechanism is the one OrthoRL targets: planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological — so teeth lose tracking and plans require mid-course corrections.

OpenAI's GDPval (Patwardhan et al. 2025) measured frontier models against human experts on 44 occupations across the top-9 GDP-contributing US sectors and reported 40–49% deliverable win-rate "approximately 100× faster, at a fraction of the cost." Dentistry / orthodontics is not among the 44 occupations. OrthoRL is the training environment GDPval skipped: same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where reducing refinements saves ~20% of aligners per case and AI automation is documented at ~80× planning speedup.


What it is

Episode 24-step sequential commit-and-feedback loop
State 28 teeth × [qw, qx, qy, qz, tx, ty, tz] (SE(3)) in canonical dental_v1 frame
Action per-stage tooth-fraction plan, parsed by parse_completion_to_poses
Tools inspect_tooth, simulate_step, check_collisions, commit_stage, rollback_stage, diagnose_angle_class, measure_crowding, measure_overbite
Five rewards terminal · occlusion · strategy · format · anchorage (all algorithmic — no LLM-as-judge)
Reward range [-2, +8] with hard-fail overrides (collision −1.0, PDL stress −0.5)
Datasets 1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites

Key innovations

  1. Pharmacokinetic force decay — translational deltas are convolved with a 5-tap kernel [0.10, 0.30, 0.40, 0.15, 0.05] modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds.
  2. Empirical anchorage prior mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = 0.89 mm, incisor = 2.42 mm — the agent gets a bounded reward signal that mirrors the population pattern. Not hand-tuned.
  3. Three-tier held-out eval — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in server/eval_split.py; EvalRegistry.assert_training_legal() raises on any leakage in train mode (test pinned).
  4. Mesh-based collision detection on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases.
  5. 5-stage training pipeline — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT.

Live demo

curl https://sri-manikanta-orthorl.hf.space/health
# {"status":"healthy"}

curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \
  -H 'Content-Type: application/json' \
  -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'
# Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame

Local quick-start

git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app                  # FastAPI on :7860
curl http://localhost:7860/health
make test                                    # 227 tests pass in ~60 s

Training

The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed to sri-manikanta/orthorl-grpo at the end.

# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh

# or just one stage (e.g. re-run GRPO without re-running SFT):
STAGES="3" bash scripts/run_full_pipeline.sh

Colab driver: notebooks/colab_a10g.py.

Logging

Every run captures (without flag combinations to remember):

File Frequency Content
logs/run_<TS>.log streaming (tee) every stdout line of every stage
logs/grpo_samples.jsonl every 10 steps step / loss / lr / per-reward-fn means / completion length
logs/grpo_completions.jsonl every 10 steps full sample completion text + slim breakdown (closes the W&B-only-scalars gap)
checkpoints/grpo/trainer_state.json every TRL step TRL log_history with all reward fns
research/results.tsv once at training end autolog row per multiautoresearch discipline
results/reward_curves.png once at training end matplotlib plot of every rewards/*/mean
W&B (if WANDB_API_KEY is in env) every step live cloud dashboard (auto-on)

After training, generate the slide kit:

uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft  --tier 1 --seeds 3
uv run python scripts/build_demo_plots.py        # 5 PNGs for the deck
uv run python scripts/build_ablation_matrix.py    # spec 1.16 table

Results

Live numbers populated post-training. The SLERP baseline below is locked in; trained-checkpoint rows fill in from eval_summary.csv after the A10G run (use scripts/build_ablation_matrix.py to regenerate).

Policy Tier 1 (Tsinghua test, N=250) Tier 2 (OFJ, N=17) Tier 3 (Bits2Bites, N=40)
SLERP baseline 0.7231 ±0.036
Stage 0 (format SFT) {{ post-training }} {{ post-training }} {{ post-training }}
Stage 2 (BC SFT) {{ post-training }} {{ post-training }} {{ post-training }}
Stage 3 (GRPO) {{ post-training }} {{ post-training }} {{ post-training }}
Stage 4 (RFT) {{ post-training }} {{ post-training }} {{ post-training }}

results/reward_curves.png, results/eval_three_tier.png, results/case_type_reward.png, results/anchorage_summary.png, results/safeguards.png, results/ablation_matrix.png are the slide- ready figures the demo deck embeds.


Repository layout

server/                      env modules (clinical_profiles, coord_frame,
                             eval_split, expert_stager, force_decay,
                             landmark_loader, mesh_collision,
                             movement_priors, reward_scaler, ...)
scripts/                     SFT/GRPO/RFT/eval pipeline drivers
                             + cache_oracle, build_demo_plots, build_ablation_matrix
specs/                       16 numbered spec files + VALIDATION_TRACKER.md
tests/                       227 passing pytest cases
data/                        SFT JSONL datasets (committed; deterministic)
datasets/                    landmark/case data (gitignored bulk; case_database.json committed)
checkpoints/                 SFT/GRPO/RFT adapters (gitignored)
results/                     eval CSV + slide-ready PNGs
research/                    pitch_script.md, results.tsv, code_review_*.md
notebooks/colab_a10g.py      one-cell launcher for the full pipeline
train_grpo.py                main GRPO entrypoint
eval.py                      held-out evaluation CLI
prepare.py / server/grader.py  IMMUTABLE benchmark (read-only)

Tests

make test           # 227 passing in ~60 s
make fast-check     # high-value subset (<5 s) used by the pre-commit hook
make install-precommit  # repo-local hook so the next regression doesn't ship

Specs

22 numbered specs in specs/. Tier 1 (must-haves): 1.1–1.16, all shipped except 1.16 (the ablation matrix table — auto-fills post-eval). Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals.

References

  • Wang et al. (2024). Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
  • Andrews LF (1972). "The six keys to normal occlusion." Am J Orthod 62(3):296–309
  • Proffit WR. Contemporary Orthodontics 6e Ch. 8 — bone remodelling timeline
  • Cattaneo PM et al. (2005). "Moment-to-force ratio." Am J Orthod Dentofacial Orthop
  • Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300
  • Yuan et al. (2023). RFT. arXiv:2308.01825
  • Patwardhan et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 — economic-importance frame
  • Grand View Research. Clear Aligners Market Report (2025) — market size, refinement-rate context

Team

  • Mehul Arora — Orthogonal Research and Education Lab
  • Vivek Mathur — M.S. by Research, IIIT Hyderabad
  • Prof. Bradly Alicea — UIUC / Orthogonal Research Lab

License

MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites licenses.