Spaces:

sri-manikanta
/

orthorl

Sleeping

App Files Files Community

orthorl / README.md

sri-manikanta

Sync code: Tier-2 OFJ wiring + Environment Control tags + incremental adapter push

b1e16c2 verified about 2 months ago

preview code

Raw

History Blame Contribute Delete

10.4 kB

metadata

title: OrthoRL — Orthodontic Treatment Planning Environment
emoji: 🦷
colorFrom: blue
colorTo: green
sdk: docker
app_port: 7860
tags:
  - openenv
  - dental
  - orthodontics
  - reinforcement-learning
  - se3
  - medical-ai

OrthoRL — Orthodontic Treatment Planning RL Environment

First RL environment for orthodontic aligner staging. 24-step sequential SE(3) planning under delayed bone biomechanics, grounded in 200 real Tsinghua patients with vertex-segmented landmarks + 1,063 clinical profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1 World-Modeling for Professional Tasks).

Why this exists

Every year 12 million patients start clear-aligner treatment. Each plan is 24 sequential decisions under biomechanical constraint — get one wrong and you trigger root resorption or treatment failure. The published RL work either picks the final alignment (CLIK-Diffusion 2025) or makes coarse extraction choices (Li & Wang 2025); nobody plans the 24 intermediate stages. This environment lets an LLM agent do exactly that.

Economic context: the refinement trap

The clear-aligner market was $8.29 B in 2025 and is projected to reach $56.81 B by 2033 (Grand View Research). Yet only 6.0% of patients finish on the original plan; the rest need refinement scans because the plan failed, averaging 2.5 refinements per patient, and 1 in 6 switches from aligners to braces entirely because the digital plan never tracked. The mechanism is the one OrthoRL targets: planners draw straight-line SLERP paths between tooth poses, but bone remodelling is delayed and biological — so teeth lose tracking and plans require mid-course corrections.

OpenAI's GDPval (Patwardhan et al. 2025) measured frontier models against human experts on 44 occupations across the top-9 GDP-contributing US sectors and reported 40–49% deliverable win-rate "approximately 100× faster, at a fraction of the cost." Dentistry / orthodontics is not among the 44 occupations. OrthoRL is the training environment GDPval skipped: same head-to-head methodology (trained agent vs SLERP baseline on identical held-out patients), applied to a domain where reducing refinements saves ~20% of aligners per case and AI automation is documented at ~80× planning speedup.

What it is


Episode	24-step sequential commit-and-feedback loop
State	28 teeth × `[qw, qx, qy, qz, tx, ty, tz]` (SE(3)) in canonical `dental_v1` frame
Action	per-stage tooth-fraction plan, parsed by `parse_completion_to_poses`
Tools	`inspect_tooth`, `simulate_step`, `check_collisions`, `commit_stage`, `rollback_stage`, `diagnose_angle_class`, `measure_crowding`, `measure_overbite`
Five rewards	`terminal · occlusion · strategy · format · anchorage` (all algorithmic — no LLM-as-judge)
Reward range	`[-2, +8]` with hard-fail overrides (collision −1.0, PDL stress −0.5)
Datasets	1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites

Key innovations

Pharmacokinetic force decay — translational deltas are convolved with a 5-tap kernel [0.10, 0.30, 0.40, 0.15, 0.05] modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds.
Empirical anchorage prior mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = 0.89 mm, incisor = 2.42 mm — the agent gets a bounded reward signal that mirrors the population pattern. Not hand-tuned.
Three-tier held-out eval — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in server/eval_split.py; EvalRegistry.assert_training_legal() raises on any leakage in train mode (test pinned).
Mesh-based collision detection on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases.
5-stage training pipeline — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT.

Live demo

curl https://sri-manikanta-orthorl.hf.space/health
# {"status":"healthy"}

curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \
  -H 'Content-Type: application/json' \
  -d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'
# Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame

Local quick-start

git clone https://github.com/mehular0ra/orthorl
cd orthorl
uv sync
uv run python -m server.app                  # FastAPI on :7860
curl http://localhost:7860/health
make test                                    # 227 tests pass in ~60 s

Training

The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed to sri-manikanta/orthorl-grpo at the end.

# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
bash scripts/run_full_pipeline.sh

# or just one stage (e.g. re-run GRPO without re-running SFT):
STAGES="3" bash scripts/run_full_pipeline.sh

Colab driver: notebooks/colab_a10g.py.

Logging

Every run captures (without flag combinations to remember):

File	Frequency	Content
`logs/run_<TS>.log`	streaming (tee)	every stdout line of every stage
`logs/grpo_samples.jsonl`	every 10 steps	step / loss / lr / per-reward-fn means / completion length
`logs/grpo_completions.jsonl`	every 10 steps	full sample completion text + slim breakdown (closes the W&B-only-scalars gap)
`checkpoints/grpo/trainer_state.json`	every TRL step	TRL log_history with all reward fns
`research/results.tsv`	once at training end	autolog row per `multiautoresearch` discipline
`results/reward_curves.png`	once at training end	matplotlib plot of every `rewards/*/mean`
W&B (if `WANDB_API_KEY` is in env)	every step	live cloud dashboard (auto-on)

After training, generate the slide kit:

uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
uv run python eval.py --policy checkpoints/rft  --tier 1 --seeds 3
uv run python scripts/build_demo_plots.py        # 5 PNGs for the deck
uv run python scripts/build_ablation_matrix.py    # spec 1.16 table

Results

Live numbers populated post-training. The SLERP baseline below is locked in; trained-checkpoint rows fill in from eval_summary.csv after the A10G run (use scripts/build_ablation_matrix.py to regenerate).

Policy	Tier 1 (Tsinghua test, N=250)	Tier 2 (OFJ, N=17)	Tier 3 (Bits2Bites, N=40)
SLERP baseline	0.7231 ±0.036	—	—
Stage 0 (format SFT)	`{{ post-training }}`	`{{ post-training }}`	`{{ post-training }}`
Stage 2 (BC SFT)	`{{ post-training }}`	`{{ post-training }}`	`{{ post-training }}`
Stage 3 (GRPO)	`{{ post-training }}`	`{{ post-training }}`	`{{ post-training }}`
Stage 4 (RFT)	`{{ post-training }}`	`{{ post-training }}`	`{{ post-training }}`

results/reward_curves.png, results/eval_three_tier.png, results/case_type_reward.png, results/anchorage_summary.png, results/safeguards.png, results/ablation_matrix.png are the slide- ready figures the demo deck embeds.

Repository layout

server/                      env modules (clinical_profiles, coord_frame,
                             eval_split, expert_stager, force_decay,
                             landmark_loader, mesh_collision,
                             movement_priors, reward_scaler, ...)
scripts/                     SFT/GRPO/RFT/eval pipeline drivers
                             + cache_oracle, build_demo_plots, build_ablation_matrix
specs/                       16 numbered spec files + VALIDATION_TRACKER.md
tests/                       227 passing pytest cases
data/                        SFT JSONL datasets (committed; deterministic)
datasets/                    landmark/case data (gitignored bulk; case_database.json committed)
checkpoints/                 SFT/GRPO/RFT adapters (gitignored)
results/                     eval CSV + slide-ready PNGs
research/                    pitch_script.md, results.tsv, code_review_*.md
notebooks/colab_a10g.py      one-cell launcher for the full pipeline
train_grpo.py                main GRPO entrypoint
eval.py                      held-out evaluation CLI
prepare.py / server/grader.py  IMMUTABLE benchmark (read-only)

Tests

make test           # 227 passing in ~60 s
make fast-check     # high-value subset (<5 s) used by the pre-commit hook
make install-precommit  # repo-local hook so the next regression doesn't ship

Specs

22 numbered specs in specs/. Tier 1 (must-haves): 1.1–1.16, all shipped except 1.16 (the ablation matrix table — auto-fills post-eval). Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals.

References

Wang et al. (2024). Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
Andrews LF (1972). "The six keys to normal occlusion." Am J Orthod 62(3):296–309
Proffit WR. Contemporary Orthodontics 6e Ch. 8 — bone remodelling timeline
Cattaneo PM et al. (2005). "Moment-to-force ratio." Am J Orthod Dentofacial Orthop
Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300
Yuan et al. (2023). RFT. arXiv:2308.01825
Patwardhan et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 — economic-importance frame
Grand View Research. Clear Aligners Market Report (2025) — market size, refinement-rate context

Team

Mehul Arora — Orthogonal Research and Education Lab
Vivek Mathur — M.S. by Research, IIIT Hyderabad
Prof. Bradly Alicea — UIUC / Orthogonal Research Lab

License

MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites licenses.