Calvin ABC→D VLA Checkpoints (starVLA)

Vision-Language-Action models trained on the CALVIN ABC→D benchmark (train on environments A/B/C, evaluate on held-out environment D) using the starVLA framework.

Each run directory contains checkpoints/steps_<N>_pytorch_model.pt and the corresponding eval_steps<N>_*/calvin_eval/results.json where evaluated.

Runs

Folder Backbone Init Framework Steps (save interval) Best avg_seq_len
qwen3vl_2b_pi_v3_1519 Qwen3-VL-2B-Instruct vanilla PI_v3 (flow-matching, chunk-10, aug, b256) 200K (25K) 3.336 @ 25K
internvl35_1b_pt0221_pi_v3_1577 InternVL3.5-1B embodied-PT 0221 PI_v3 100K→TIMEOUT@75K (12.5K) 1.076 @ 75K
qwen3vl_2b_pt0221_pi_v3_1578 Qwen3-VL-2B-Instruct embodied-PT 0221 PI_v3 100K (12.5K) 3.064 @ 100K
internvl35_1b_vanilla_pi_v3_1579 InternVL3.5-1B vanilla PI_v3 100K (12.5K) 2.052 @ 50K
internvl35_1b_pt0221_oft_1844 InternVL3.5-1B embodied-PT 0221 OFT (DiT-B, chunk-10, b64) 80K (10K) not yet evaluated
qwen3vl_2b_pt0221_oft_1845 Qwen3-VL-2B-Instruct embodied-PT 0221 OFT 80K (10K) not yet evaluated
internvl35_1b_vanilla_oft_1846 InternVL3.5-1B vanilla OFT 80K (10K) not yet evaluated
internvl35_1b_pt0210_oft_1847 InternVL3.5-1B embodied-PT 0210 OFT CANCELLED @ 10K (partial) not yet evaluated

Key findings (PI_v3 runs)

  • Best model overall: Qwen3-VL-2B vanilla, PI_v3, step 25K → 3.336 avg_seq_len.
  • Embodied pretraining (PT0221) hurts InternVL3.5-1B on Calvin: vanilla beats PT0221 at every matched checkpoint (mean Δ ≈ +0.55 avg_seq_len).
  • Long training overfits: 1519 peaks at 25K (3.336) then decays to 2.84 by 200K.

Eval protocol

CALVIN long-horizon eval, 1000 chained sequences of 5 subtasks each. avg_seq_len ∈ [0, 5] is the mean number of consecutive subtasks completed.

Downloads last month

-

Downloads are not tracked for this model. How to track
Video Preview
loading