Robotics
LeRobot
pi0
vla
imitation-learning
so101
abdul004's picture
Upload README.md with huggingface_hub
41c6951 verified
|
Raw
History Blame
5.61 kB
metadata
license: apache-2.0
tags:
  - robotics
  - lerobot
  - pi0
  - vla
  - imitation-learning
  - so101
datasets:
  - abdul004/so101_ball_in_cup_v5
pipeline_tag: robotics

SO-101 Ball-in-Cup Pi0.5 Policy

A fine-tuned Pi0.5 (Ο€β‚€.β‚…) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.

Task Description

Goal: Pick up an orange ball from the table and place it into a pink cup.

Robot: SO-101 - 6-DOF robot arm with gripper

Cameras: Dual camera setup (overhead + wrist-mounted)

Model Architecture

Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:

Component Description
Vision Encoder SigLIP 400M - processes camera images
Language Model Gemma 2B - scene understanding & task grounding
Action Expert Flow Matching head - generates smooth action trajectories
Total Parameters ~3B

The model takes natural language instructions + camera images β†’ outputs continuous joint actions.

Training Details

Parameter Value
Base Model Pi0.5 (Physical Intelligence)
Dataset abdul004/so101_ball_in_cup_v5
Episodes 72 teleoperated demonstrations
Frames 25,045
Fine-tuning Steps 5,000
Hardware A100 80GB on RunPod
Training Time ~3-4 hours
Cost ~$6-8 USD
Framework OpenPi (JAX/Flax)

Inference Performance

JPEG Compression Optimization

We implemented JPEG compression to reduce network transfer time for remote inference:

Location Raw Images JPEG (Q80) Speedup
EU Spot 1448ms 375ms 3.9x
US On-Demand 600ms 270ms 2.2x
Metric Before After
Payload Size 1.8 MB 71 KB
Control Rate (US) 1.7 Hz 3.7 Hz
Compression Ratio - 25x

Architecture

[RunPod GPU Server]              [Robot Mac]
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”              β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Pi0.5 Model     │◄── WSS ────►│ run_pi05.py  β”‚
β”‚ (RTX 4090)      β”‚   JPEG      β”‚ (Robot ctrl) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜              β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Demo

With JPEG Compression (~270ms latency)

Evaluation Demo - JPEG Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control

Without JPEG Compression (~600ms latency)

Evaluation Demo - Raw Side-by-side: Same task but with raw image transfer - 1.7 Hz control

Sample Evaluation

JPEG Compression (Fast)

Evaluation Composite - JPEG 5-frame composite: Start β†’ Approach β†’ Grasp β†’ Transport β†’ Final

Raw Images (Slow)

Evaluation Composite - Raw Same task without JPEG optimization

Usage

Server Setup (RunPod)

# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync

# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
    --include "4999/**" \
    --local-dir checkpoints/pi05_so101

# Start server
uv run scripts/serve_policy.py --port 8000 \
    policy:checkpoint \
    --policy.config=pi05_so101 \
    --policy.dir=checkpoints/pi05_so101/4999

Client (Robot Mac)

pip install openpi-client

# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net

# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg

Comparison with ACT Policy

Trained on the same dataset:

Policy Architecture Inference Grasp Generalization
Pi0.5 VLA (3B params) Remote GPU βœ… βœ… Edge positions
ACT Transformer (25M) Local βœ… ⚠️ Center only

Key advantage: Pi0.5 successfully picks up ball from edge positions that ACT couldn't handle - demonstrates better generalization from VLA pre-training.

Infrastructure Notes

Remote Inference Setup:

  • Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
  • Client: Mac Mini M4 controlling SO-101 robot
  • Protocol: WebSocket with msgpack serialization
  • Optimization: JPEG compression reduces 1.8MB β†’ 71KB per inference

Known Issues:

  • RTX 4090 is borderline for memory - occasional OOM during model loading
  • US datacenters preferred (2x faster than EU for network transfer)
  • First inference takes 30-60s (JAX JIT compilation)

Limitations

  • Requires GPU server for inference (not yet optimized for edge deployment)
  • Sensitive to lighting changes
  • 72 training episodes may limit extreme edge case handling

Citation

@misc{so101_pi05_ball_in_cup,
  author = {Abdul},
  title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}

Acknowledgments