Calvin ABC→D VLA Checkpoints (starVLA)

Vision-Language-Action models trained on the CALVIN ABC→D benchmark (train on environments A/B/C, evaluate on held-out environment D) using the starVLA framework.

Each run directory contains checkpoints/steps_<N>_pytorch_model.pt and the corresponding eval_steps<N>_*/calvin_eval/results.json where evaluated.

Runs

Folder	Backbone	Init	Framework	Steps (save interval)	Best avg_seq_len
`qwen3vl_2b_pi_v3_1519`	Qwen3-VL-2B-Instruct	vanilla	PI_v3 (flow-matching, chunk-10, aug, b256)	200K (25K)	3.336 @ 25K
`internvl35_1b_pt0221_pi_v3_1577`	InternVL3.5-1B	embodied-PT 0221	PI_v3	100K→TIMEOUT@75K (12.5K)	1.076 @ 75K
`qwen3vl_2b_pt0221_pi_v3_1578`	Qwen3-VL-2B-Instruct	embodied-PT 0221	PI_v3	100K (12.5K)	3.064 @ 100K
`internvl35_1b_vanilla_pi_v3_1579`	InternVL3.5-1B	vanilla	PI_v3	100K (12.5K)	2.052 @ 50K
`internvl35_1b_pt0221_oft_1844`	InternVL3.5-1B	embodied-PT 0221	OFT (DiT-B, chunk-10, b64)	80K (10K)	not yet evaluated
`qwen3vl_2b_pt0221_oft_1845`	Qwen3-VL-2B-Instruct	embodied-PT 0221	OFT	80K (10K)	not yet evaluated
`internvl35_1b_vanilla_oft_1846`	InternVL3.5-1B	vanilla	OFT	80K (10K)	not yet evaluated
`internvl35_1b_pt0210_oft_1847`	InternVL3.5-1B	embodied-PT 0210	OFT	CANCELLED @ 10K (partial)	not yet evaluated

Key findings (PI_v3 runs)

Best model overall: Qwen3-VL-2B vanilla, PI_v3, step 25K → 3.336 avg_seq_len.
Embodied pretraining (PT0221) hurts InternVL3.5-1B on Calvin: vanilla beats PT0221 at every matched checkpoint (mean Δ ≈ +0.55 avg_seq_len).
Long training overfits: 1519 peaks at 25K (3.336) then decays to 2.84 by 200K.

Eval protocol

CALVIN long-horizon eval, 1000 chained sequences of 5 subtasks each. avg_seq_len ∈ [0, 5] is the mean number of consecutive subtasks completed.

Downloads last month: -; Downloads are not tracked for this model. How to track

Video Preview

Robotics