--- license: apache-2.0 tags: - robotics - lerobot - pi0 - vla - imitation-learning - so101 datasets: - abdul004/so101_ball_in_cup_v5 pipeline_tag: robotics --- # SO-101 Ball-in-Cup Pi0.5 Policy A fine-tuned [Pi0.5 (π₀.₅)](https://www.physicalintelligence.company/blog/pi05) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm. ## Task Description **Goal:** Pick up an orange ball from the table and place it into a pink cup. **Robot:** [SO-101](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md) - 6-DOF robot arm with gripper **Cameras:** Dual camera setup (overhead + wrist-mounted) ## Model Architecture Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence: | Component | Description | |-----------|-------------| | **Vision Encoder** | SigLIP 400M - processes camera images | | **Language Model** | Gemma 2B - scene understanding & task grounding | | **Action Expert** | Flow Matching head - generates smooth action trajectories | | **Total Parameters** | ~3B | The model takes natural language instructions + camera images → outputs continuous joint actions. ## Training Details | Parameter | Value | |-----------|-------| | **Base Model** | Pi0.5 (Physical Intelligence) | | **Dataset** | [abdul004/so101_ball_in_cup_v5](https://huggingface.co/datasets/abdul004/so101_ball_in_cup_v5) | | **Episodes** | 72 teleoperated demonstrations | | **Frames** | 25,045 | | **Fine-tuning Steps** | 5,000 | | **Hardware** | A100 80GB on RunPod | | **Training Time** | ~3-4 hours | | **Cost** | ~$6-8 USD | | **Framework** | OpenPi (JAX/Flax) | ## Inference Performance ### JPEG Compression Optimization We implemented JPEG compression to reduce network transfer time for remote inference: | Location | Raw Images | JPEG (Q80) | Speedup | |----------|-----------|------------|---------| | EU Spot | 1448ms | 375ms | **3.9x** | | **US On-Demand** | **600ms** | **270ms** | **2.2x** | | Metric | Before | After | |--------|--------|-------| | **Payload Size** | 1.8 MB | 71 KB | | **Control Rate (US)** | 1.7 Hz | 3.7 Hz | | **Compression Ratio** | - | 25x | ### Architecture ``` [RunPod GPU Server] [Robot Mac] ┌─────────────────┐ ┌──────────────┐ │ Pi0.5 Model │◄── WSS ────►│ run_pi05.py │ │ (RTX 4090) │ JPEG │ (Robot ctrl) │ └─────────────────┘ └──────────────┘ ``` ## Demo ### With JPEG Compression (~270ms latency) ![Evaluation Demo - JPEG](eval_demo_jpeg.gif) *Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control* ### Without JPEG Compression (~600ms latency) ![Evaluation Demo - Raw](eval_demo_raw.gif) *Side-by-side: Same task but with raw image transfer - 1.7 Hz control* ## Sample Evaluation ### JPEG Compression (Fast) ![Evaluation Composite - JPEG](eval_composite_jpeg.png) *5-frame composite: Start → Approach → Grasp → Transport → Final* ### Raw Images (Slow) ![Evaluation Composite - Raw](eval_composite_raw.png) *Same task without JPEG optimization* ## Usage ### Server Setup (RunPod) ```bash # Clone OpenPi fork with JPEG support git clone https://github.com/abdulrahman004/openpi.git cd openpi uv sync # Download checkpoint uv run huggingface-cli download abdul004/pi05_so101_checkpoint \ --include "4999/**" \ --local-dir checkpoints/pi05_so101 # Start server uv run scripts/serve_policy.py --port 8000 \ policy:checkpoint \ --policy.config=pi05_so101 \ --policy.dir=checkpoints/pi05_so101/4999 ``` ### Client (Robot Mac) ```bash pip install openpi-client # Run inference with JPEG compression python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net # Or without compression (slower) python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg ``` ## Comparison with ACT Policy Trained on the same dataset: | Policy | Architecture | Inference | Grasp | Generalization | |--------|-------------|-----------|-------|----------------| | **Pi0.5** | VLA (3B params) | Remote GPU | ✅ | ✅ Edge positions | | ACT | Transformer (25M) | Local | ✅ | ⚠️ Center only | **Key advantage:** Pi0.5 successfully picks up ball from edge positions that ACT couldn't handle - demonstrates better generalization from VLA pre-training. ## Infrastructure Notes **Remote Inference Setup:** - Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand) - Client: Mac Mini M4 controlling SO-101 robot - Protocol: WebSocket with msgpack serialization - Optimization: JPEG compression reduces 1.8MB → 71KB per inference **Known Issues:** - RTX 4090 is borderline for memory - occasional OOM during model loading - US datacenters preferred (2x faster than EU for network transfer) - First inference takes 30-60s (JAX JIT compilation) ## Limitations - Requires GPU server for inference (not yet optimized for edge deployment) - Sensitive to lighting changes - 72 training episodes may limit extreme edge case handling ## Citation ```bibtex @misc{so101_pi05_ball_in_cup, author = {Abdul}, title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning}, year = {2026}, publisher = {Hugging Face}, url = {https://huggingface.co/abdul004/pi05_so101_checkpoint} } ``` ## Acknowledgments - [Physical Intelligence](https://www.physicalintelligence.company/) for Pi0.5 and OpenPi - [LeRobot](https://github.com/huggingface/lerobot) by Hugging Face - SO-101 robot design community