Spaces:

sri-manikanta
/

orthorl

Sleeping

App Files Files Community

orthorl / README.md

sri-manikanta

Sync code: Tier-2 OFJ wiring + Environment Control tags + incremental adapter push

b1e16c2 verified about 2 months ago

preview code

Raw

History Blame Contribute Delete

10.4 kB

	---
	title: OrthoRL — Orthodontic Treatment Planning Environment
	emoji: 🦷
	colorFrom: blue
	colorTo: green
	sdk: docker
	app_port: 7860
	tags:
	- openenv
	- dental
	- orthodontics
	- reinforcement-learning
	- se3
	- medical-ai
	---

	# OrthoRL — Orthodontic Treatment Planning RL Environment

	> First RL environment for orthodontic aligner staging. 24-step sequential
	> SE(3) planning under delayed bone biomechanics, grounded in **200 real
	> Tsinghua patients with vertex-segmented landmarks** + 1,063 clinical
	> profiles. Built for the OpenEnv AI India 2026 hackathon (Theme 3.1
	> World-Modeling for Professional Tasks).

	[![Live Space](https://img.shields.io/badge/HF%20Space-live-success)](https://huggingface.co/spaces/sri-manikanta/orthorl)
	[![Tests](https://img.shields.io/badge/tests-227%20passing-brightgreen)](#tests)
	[![License](https://img.shields.io/badge/license-MIT-blue)](LICENSE)

	---

	## Why this exists

	Every year 12 million patients start clear-aligner treatment. Each
	plan is 24 sequential decisions under biomechanical constraint — get
	one wrong and you trigger root resorption or treatment failure. The
	published RL work either picks the final alignment (CLIK-Diffusion
	2025) or makes coarse extraction choices (Li & Wang 2025); **nobody
	plans the 24 intermediate stages**. This environment lets an LLM agent
	do exactly that.

	### Economic context: the refinement trap

	The clear-aligner market was $8.29 B in 2025 and is projected to
	reach $56.81 B by 2033 ([Grand View Research][gvr]). Yet only
	6.0% of patients finish on the original plan; the rest need
	refinement scans because the plan failed, averaging **2.5 refinements
	per patient, and 1 in 6 switches from aligners to braces entirely**
	because the digital plan never tracked. The mechanism is the one
	OrthoRL targets: planners draw straight-line SLERP paths between tooth
	poses, but bone remodelling is delayed and biological — so teeth lose
	tracking and plans require mid-course corrections.

	OpenAI's GDPval ([Patwardhan et al. 2025][gdpval]) measured
	frontier models against human experts on **44 occupations across the
	top-9 GDP-contributing US sectors** and reported 40–49% deliverable
	win-rate "approximately 100× faster, at a fraction of the cost."
	Dentistry / orthodontics is not among the 44 occupations. OrthoRL
	is the training environment GDPval skipped: same head-to-head
	methodology (trained agent vs SLERP baseline on identical held-out
	patients), applied to a domain where reducing refinements saves ~20%
	of aligners per case and AI automation is documented at ~80× planning
	speedup.

	[gvr]: https://www.grandviewresearch.com/industry-analysis/clear-aligners-market
	[gdpval]: https://arxiv.org/abs/2510.04374

	---

	## What it is

	\| \| \|
	\|---\|---\|
	\| Episode \| 24-step sequential commit-and-feedback loop \|
	\| State \| 28 teeth × `[qw, qx, qy, qz, tx, ty, tz]` (SE(3)) in canonical `dental_v1` frame \|
	\| Action \| per-stage tooth-fraction plan, parsed by `parse_completion_to_poses` \|
	\| Tools \| `inspect_tooth`, `simulate_step`, `check_collisions`, `commit_stage`, `rollback_stage`, `diagnose_angle_class`, `measure_crowding`, `measure_overbite` \|
	\| Five rewards \| `terminal · occlusion · strategy · format · anchorage` (all algorithmic — no LLM-as-judge) \|
	\| Reward range \| `[-2, +8]` with hard-fail overrides (collision −1.0, PDL stress −0.5) \|
	\| Datasets \| 1,063 Tsinghua profiles · 200 Tsinghua landmark patients · 17 Open-Full-Jaw · 200 Bits2Bites \|

	### Key innovations

	1. Pharmacokinetic force decay — translational deltas are convolved with a 5-tap kernel `[0.10, 0.30, 0.40, 0.15, 0.05]` modelling bone-remodelling delay (Proffit Ch. 8). The agent has to plan ~2 stages ahead of where it wants the tooth to land. Verified directional drop on SLERP across 5 seeds.
	2. Empirical anchorage prior mined from 195 real treatments (n=5,089 tooth-class observations). Real molar median = 0.89 mm, incisor = 2.42 mm — the agent gets a bounded reward signal that mirrors the population pattern. Not hand-tuned.
	3. Three-tier held-out eval — 250 Tsinghua test + 17 OFJ + 40 Bits2Bites IDs frozen in `server/eval_split.py`; `EvalRegistry.assert_training_legal()` raises on any leakage in train mode (test pinned).
	4. Mesh-based collision detection on real per-tooth vertex segmentation; ellipsoid fallback for synthetic cases.
	5. 5-stage training pipeline — Format SFT → Tool-use SFT → Behavioural-cloning SFT → GRPO (5 reward funcs) → Rejection-sampling FT.

	---

	## Live demo

	```bash
	curl https://sri-manikanta-orthorl.hf.space/health
	# {"status":"healthy"}

	curl -X POST https://sri-manikanta-orthorl.hf.space/reset_stepwise \
	-H 'Content-Type: application/json' \
	-d '{"task_id":"tsinghua/0001","mode":"eval","tier":1,"eval_idx":0}'
	# Real Patient 0001 (Tsinghua) → 28×7 poses in dental_v1 frame
	```

	---

	## Local quick-start

	```bash
	git clone https://github.com/mehular0ra/orthorl
	cd orthorl
	uv sync
	uv run python -m server.app # FastAPI on :7860
	curl http://localhost:7860/health
	make test # 227 tests pass in ~60 s
	```

	---

	## Training

	The full SFT → GRPO → RFT chain on a single A10G or T4. Adapter pushed
	to `sri-manikanta/orthorl-grpo` at the end.

	```bash
	# end-to-end on a GPU host (~3.5 h on A10G ≈ $3.50; ~8.5 h on T4 ≈ $3.40)
	bash scripts/run_full_pipeline.sh

	# or just one stage (e.g. re-run GRPO without re-running SFT):
	STAGES="3" bash scripts/run_full_pipeline.sh
	```

	Colab driver: [`notebooks/colab_a10g.py`](notebooks/colab_a10g.py).

	### Logging

	Every run captures (without flag combinations to remember):

	\| File \| Frequency \| Content \|
	\|---\|---\|---\|
	\| `logs/run_<TS>.log` \| streaming (tee) \| every stdout line of every stage \|
	\| `logs/grpo_samples.jsonl` \| every 10 steps \| step / loss / lr / per-reward-fn means / completion length \|
	\| `logs/grpo_completions.jsonl` \| every 10 steps \| full sample completion text + slim breakdown (closes the W&B-only-scalars gap) \|
	\| `checkpoints/grpo/trainer_state.json` \| every TRL step \| TRL log_history with all reward fns \|
	\| `research/results.tsv` \| once at training end \| autolog row per `multiautoresearch` discipline \|
	\| `results/reward_curves.png` \| once at training end \| matplotlib plot of every `rewards/*/mean` \|
	\| W&B (if `WANDB_API_KEY` is in env) \| every step \| live cloud dashboard (auto-on) \|

	After training, generate the slide kit:

	```bash
	uv run python eval.py --policy checkpoints/grpo --tier 1 --seeds 3
	uv run python eval.py --policy checkpoints/rft --tier 1 --seeds 3
	uv run python scripts/build_demo_plots.py # 5 PNGs for the deck
	uv run python scripts/build_ablation_matrix.py # spec 1.16 table
	```

	---

	## Results

	> Live numbers populated post-training. The SLERP baseline below is locked
	> in; trained-checkpoint rows fill in from `eval_summary.csv` after the
	> A10G run (use `scripts/build_ablation_matrix.py` to regenerate).

	\| Policy \| Tier 1 (Tsinghua test, N=250) \| Tier 2 (OFJ, N=17) \| Tier 3 (Bits2Bites, N=40) \|
	\|---\|---\|---\|---\|
	\| SLERP baseline \| 0.7231 ±0.036 \| — \| — \|
	\| Stage 0 (format SFT) \| `{{ post-training }}` \| `{{ post-training }}` \| `{{ post-training }}` \|
	\| Stage 2 (BC SFT) \| `{{ post-training }}` \| `{{ post-training }}` \| `{{ post-training }}` \|
	\| Stage 3 (GRPO) \| `{{ post-training }}` \| `{{ post-training }}` \| `{{ post-training }}` \|
	\| Stage 4 (RFT) \| `{{ post-training }}` \| `{{ post-training }}` \| `{{ post-training }}` \|

	`results/reward_curves.png`, `results/eval_three_tier.png`,
	`results/case_type_reward.png`, `results/anchorage_summary.png`,
	`results/safeguards.png`, `results/ablation_matrix.png` are the slide-
	ready figures the demo deck embeds.

	---

	## Repository layout

	```
	server/ env modules (clinical_profiles, coord_frame,
	eval_split, expert_stager, force_decay,
	landmark_loader, mesh_collision,
	movement_priors, reward_scaler, ...)
	scripts/ SFT/GRPO/RFT/eval pipeline drivers
	+ cache_oracle, build_demo_plots, build_ablation_matrix
	specs/ 16 numbered spec files + VALIDATION_TRACKER.md
	tests/ 227 passing pytest cases
	data/ SFT JSONL datasets (committed; deterministic)
	datasets/ landmark/case data (gitignored bulk; case_database.json committed)
	checkpoints/ SFT/GRPO/RFT adapters (gitignored)
	results/ eval CSV + slide-ready PNGs
	research/ pitch_script.md, results.tsv, code_review_*.md
	notebooks/colab_a10g.py one-cell launcher for the full pipeline
	train_grpo.py main GRPO entrypoint
	eval.py held-out evaluation CLI
	prepare.py / server/grader.py IMMUTABLE benchmark (read-only)
	```

	## Tests

	```
	make test # 227 passing in ~60 s
	make fast-check # high-value subset (<5 s) used by the pre-commit hook
	make install-precommit # repo-local hook so the next regression doesn't ship
	```

	## Specs

	22 numbered specs in `specs/`. Tier 1 (must-haves): 1.1–1.16, all
	shipped except 1.16 (the ablation matrix table — auto-fills post-eval).
	Tier 2 (separators): 2.1–2.6, all shipped. Tier 3+ are reach goals.

	## References

	- Wang et al. (2024). Nature Scientific Data 11:1277. DOI: 10.1038/s41597-024-04138-7
	- Andrews LF (1972). "The six keys to normal occlusion." Am J Orthod 62(3):296–309
	- Proffit WR. Contemporary Orthodontics 6e Ch. 8 — bone remodelling timeline
	- Cattaneo PM et al. (2005). "Moment-to-force ratio." Am J Orthod Dentofacial Orthop
	- Shao et al. (2024). DeepSeekMath / GRPO. arXiv:2402.03300
	- Yuan et al. (2023). RFT. arXiv:2308.01825
	- Patwardhan et al. (2025). GDPval: Evaluating AI Model Performance on Real-World Economically Valuable Tasks. arXiv:2510.04374 — economic-importance frame
	- Grand View Research. Clear Aligners Market Report (2025) — market size, refinement-rate context

	## Team

	- Mehul Arora — Orthogonal Research and Education Lab
	- Vivek Mathur — M.S. by Research, IIIT Hyderabad
	- Prof. Bradly Alicea — UIUC / Orthogonal Research Lab

	## License

	MIT. Patient data inherits the original Tsinghua / OFJ / Bits2Bites
	licenses.