rskill-smolvla-metaworld

OpenRAL rSkill — SmolVLA (0.45 B) finetuned on the MetaWorld MT50 benchmark (50 manipulation tasks, Rethink Sawyer arm).

Quick start

from openral_rskill.loader import rSkill
pkg = rSkill.from_yaml("rskills/smolvla-metaworld/rskill.yaml")

# Single demo scene (BenchmarkScene tier, paper protocol):
openral benchmark scene --config scenes/benchmark/metaworld_push.yaml \
    --rskill rskills/smolvla-metaworld

# Full headline suites (write eval/<suite>.json with reproduced_locally=true):
openral benchmark run --suite metaworld_mt10 --rskill rskills/smolvla-metaworld
openral benchmark run --suite metaworld_mt50 --rskill rskills/smolvla-metaworld

Upstream model

Field	Value
Source repo	`lerobot/smolvla_metaworld`
Base model	`lerobot/smolvla_base`
Paper	arxiv:2506.01844 — SmolVLA: Efficient Vision-Language-Action Model
License	Apache-2.0
Parameters	~450 M
Benchmark	MetaWorld MT50 (50 tasks, Rethink Sawyer)
Training data	`lerobot/metaworld_mt50`

The checkpoint is multi-task: a single set of weights covers the whole MetaWorld family, so the manifest gates it with the family entry evaluated_tasks: ["metaworld"] (covers every metaworld/<task>-v3 task id and the bare metaworld scene id). The 4-D proprio state, the single observation.images.camera1 RGB input, and the normalisation statistics are verified against the lerobot checkpoint — see docs/reference/vla_compatibility.md §3.2.

Supported robots

Robot	Embodiment tag	Status	Notes
Rethink Sawyer (MetaWorld sim)	`sawyer`	✓ matches	Native training embodiment.
Franka Panda / SO-100	—	does not match	The `libero` / `so100_follower` tags are intentionally excluded; MetaWorld uses a different task distribution and camera setup.

Sensors required

Key	Modality	Min resolution	Notes
`observation.images.camera1`	RGB	224 × 224	Mapped from MetaWorld's corner camera (`corner2`, 480×480 native). No adapter image flip — lerobot's `MetaworldEnv` already corrects the corner camera's 180° inversion.

Manifest summary

Field	Value
`name`	`OpenRAL/rskill-smolvla-metaworld`
`version`	`0.1.0`
`license`	`apache-2.0`
`role`	`s1`
`runtime` / `quantization.dtype`	`pytorch` / `bf16`
`weights_uri`	`hf://lerobot/smolvla_metaworld`
`latency_budget.per_chunk_ms`	150 ms
`evaluated_tasks`	`["metaworld"]` (family gate)
`commercial_use_allowed`	`true`

Full schema: openral_core.RSkillManifest.

Evaluation

Locally reproduced on the MetaWorld MT50 suite via openral benchmark run --suite metaworld_mt50 (reproduced_locally: true, see eval/metaworld_mt50.json).

Suite	Tasks	Protocol	Result
MT50	50	1 episode / seed 0 / `max_steps=200`	16/50 solved · avg 0.30

The MT50 run also covers all 10 MT10 tasks; a dedicated openral benchmark run --suite metaworld_mt10 reproduction can be written to eval/metaworld_mt10.json by re-running the command above. Raise n_episodes for a paper-equivalent (50-goals/task) number.

Solved at seed 0 (success_rate 1.0): assembly-v3, button-press-topdown-v3, button-press-v3, coffee-button-v3, door-close-v3, door-lock-v3, drawer-close-v3, faucet-close-v3, handle-press-side-v3, handle-press-v3, pick-place-v3, pick-place-wall-v3, plate-slide-back-side-v3, plate-slide-side-v3, push-v3.

Single-episode/seed-0 numbers are a cheap smoke of the headline set, not a paper claim — per-task success on harder tasks is seed-sensitive.

Demo scenes

Five single-task BenchmarkScene entries (website-demo tier, 500-step horizon, 50 episodes) live under scenes/benchmark/: metaworld_push.yaml, metaworld_pick_place.yaml, metaworld_button_press.yaml, metaworld_door_open.yaml, metaworld_drawer_open.yaml. The full sweeps live in benchmarks/metaworld_mt10.yaml (10 tasks) and benchmarks/metaworld_mt50.yaml (50 tasks).

License

This rSkill package (rskill.yaml, README.md, eval/metaworld_mt50.json) is Apache-2.0. The wrapped weights are also Apache-2.0. Commercial use is allowed.

Paper for OpenRAL/rskill-smolvla-metaworld

SmolVLA: A Vision-Language-Action Model for Affordable and Efficient Robotics

Paper • 2506.01844 • Published Jun 2, 2025 • 161

OpenRAL
/

rskill-smolvla-metaworld