Instructions to use abdul004/pi05_so101_checkpoint with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LeRobot
How to use abdul004/pi05_so101_checkpoint with LeRobot:
- Notebooks
- Google Colab
- Kaggle
license: apache-2.0
tags:
- robotics
- lerobot
- pi0
- vla
- imitation-learning
- so101
datasets:
- abdul004/so101_ball_in_cup_v5
pipeline_tag: robotics
SO-101 Ball-in-Cup Pi0.5 Policy
A fine-tuned Pi0.5 (Οβ.β ) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.
Task Description
Goal: Pick up an orange ball from the table and place it into a pink cup.
Robot: SO-101 - 6-DOF robot arm with gripper
Cameras: Dual camera setup (overhead + wrist-mounted)
Model Architecture
Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:
| Component | Description |
|---|---|
| Vision Encoder | SigLIP 400M - processes camera images |
| Language Model | Gemma 2B - scene understanding & task grounding |
| Action Expert | Flow Matching head - generates smooth action trajectories |
| Total Parameters | ~3B |
The model takes natural language instructions + camera images β outputs continuous joint actions.
Training Details
| Parameter | Value |
|---|---|
| Base Model | Pi0.5 (Physical Intelligence) |
| Dataset | abdul004/so101_ball_in_cup_v5 |
| Episodes | 72 teleoperated demonstrations |
| Frames | 25,045 |
| Fine-tuning Steps | 5,000 |
| Hardware | A100 80GB on RunPod |
| Training Time | ~3-4 hours |
| Cost | ~$6-8 USD |
| Framework | OpenPi (JAX/Flax) |
Inference Performance
JPEG Compression Optimization
We implemented JPEG compression to reduce network transfer time for remote inference:
| Location | Raw Images | JPEG (Q80) | Speedup |
|---|---|---|---|
| EU Spot | 1448ms | 375ms | 3.9x |
| US On-Demand | 600ms | 270ms | 2.2x |
| Metric | Before | After |
|---|---|---|
| Payload Size | 1.8 MB | 71 KB |
| Control Rate (US) | 1.7 Hz | 3.7 Hz |
| Compression Ratio | - | 25x |
Architecture
[RunPod GPU Server] [Robot Mac]
βββββββββββββββββββ ββββββββββββββββ
β Pi0.5 Model ββββ WSS βββββΊβ run_pi05.py β
β (RTX 4090) β JPEG β (Robot ctrl) β
βββββββββββββββββββ ββββββββββββββββ
Demo
With JPEG Compression (~270ms latency)
Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control
Without JPEG Compression (~600ms latency)
Side-by-side: Same task but with raw image transfer - 1.7 Hz control
Sample Evaluation
JPEG Compression (Fast)
5-frame composite: Start β Approach β Grasp β Transport β Final
Raw Images (Slow)
Same task without JPEG optimization
Usage
Server Setup (RunPod)
# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync
# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
--include "4999/**" \
--local-dir checkpoints/pi05_so101
# Start server
uv run scripts/serve_policy.py --port 8000 \
policy:checkpoint \
--policy.config=pi05_so101 \
--policy.dir=checkpoints/pi05_so101/4999
Client (Robot Mac)
pip install openpi-client
# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net
# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg
Comparison with ACT Policy
Trained on the same dataset:
| Policy | Architecture | Inference | Grasp | Generalization |
|---|---|---|---|---|
| Pi0.5 | VLA (3B params) | Remote GPU | β | β Edge positions |
| ACT | Transformer (25M) | Local | β | β οΈ Center only |
Key advantage: Pi0.5 successfully picks up ball from edge positions that ACT couldn't handle - demonstrates better generalization from VLA pre-training.
Infrastructure Notes
Remote Inference Setup:
- Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
- Client: Mac Mini M4 controlling SO-101 robot
- Protocol: WebSocket with msgpack serialization
- Optimization: JPEG compression reduces 1.8MB β 71KB per inference
Known Issues:
- RTX 4090 is borderline for memory - occasional OOM during model loading
- US datacenters preferred (2x faster than EU for network transfer)
- First inference takes 30-60s (JAX JIT compilation)
Limitations
- Requires GPU server for inference (not yet optimized for edge deployment)
- Sensitive to lighting changes
- 72 training episodes may limit extreme edge case handling
Citation
@misc{so101_pi05_ball_in_cup,
author = {Abdul},
title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
year = {2026},
publisher = {Hugging Face},
url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}
Acknowledgments
- Physical Intelligence for Pi0.5 and OpenPi
- LeRobot by Hugging Face
- SO-101 robot design community