---
license: apache-2.0
tags:
  - robotics
  - lerobot
  - pi0
  - vla
  - imitation-learning
  - so101
datasets:
  - abdul004/so101_ball_in_cup_v5
pipeline_tag: robotics
---

# SO-101 Ball-in-Cup Pi0.5 Policy

A fine-tuned [Pi0.5 (π₀.₅)](https://www.physicalintelligence.company/blog/pi05) Vision-Language-Action model for the ball-in-cup task using the SO-101 robot arm.

## Task Description

**Goal:** Pick up an orange ball from the table and place it into a pink cup.

**Robot:** [SO-101](https://github.com/huggingface/lerobot/blob/main/examples/10_use_so100.md) - 6-DOF robot arm with gripper

**Cameras:** Dual camera setup (overhead + wrist-mounted)

## Model Architecture

Pi0.5 is a Vision-Language-Action (VLA) model from Physical Intelligence:

| Component | Description |
|-----------|-------------|
| **Vision Encoder** | SigLIP 400M - processes camera images |
| **Language Model** | Gemma 2B - scene understanding & task grounding |
| **Action Expert** | Flow Matching head - generates smooth action trajectories |
| **Total Parameters** | ~3B |

The model takes natural language instructions + camera images → outputs continuous joint actions.

## Training Details

| Parameter | Value |
|-----------|-------|
| **Base Model** | Pi0.5 (Physical Intelligence) |
| **Dataset** | [abdul004/so101_ball_in_cup_v5](https://huggingface.co/datasets/abdul004/so101_ball_in_cup_v5) |
| **Episodes** | 72 teleoperated demonstrations |
| **Frames** | 25,045 |
| **Fine-tuning Steps** | 5,000 |
| **Hardware** | A100 80GB on RunPod |
| **Training Time** | ~3-4 hours |
| **Cost** | ~$6-8 USD |
| **Framework** | OpenPi (JAX/Flax) |

## Inference Performance

### JPEG Compression Optimization

We implemented JPEG compression to reduce network transfer time for remote inference:

| Location | Raw Images | JPEG (Q80) | Speedup |
|----------|-----------|------------|---------|
| EU Spot | 1448ms | 375ms | **3.9x** |
| **US On-Demand** | **600ms** | **270ms** | **2.2x** |

| Metric | Before | After |
|--------|--------|-------|
| **Payload Size** | 1.8 MB | 71 KB |
| **Control Rate (US)** | 1.7 Hz | 3.7 Hz |
| **Compression Ratio** | - | 25x |

### Architecture

```
[RunPod GPU Server]              [Robot Mac]
┌─────────────────┐              ┌──────────────┐
│ Pi0.5 Model     │◄── WSS ────►│ run_pi05.py  │
│ (RTX 4090)      │   JPEG      │ (Robot ctrl) │
└─────────────────┘              └──────────────┘
```

## Demo

### With JPEG Compression (~270ms latency)

![Evaluation Demo - JPEG](eval_demo_jpeg.gif)
*Side-by-side: Overhead camera (left) + Wrist camera (right) - Smooth 3.7 Hz control*

### Without JPEG Compression (~600ms latency)

![Evaluation Demo - Raw](eval_demo_raw.gif)
*Side-by-side: Same task but with raw image transfer - 1.7 Hz control*

## Sample Evaluation

### JPEG Compression (Fast)
![Evaluation Composite - JPEG](eval_composite_jpeg.png)
*5-frame composite: Start → Approach → Grasp → Transport → Final*

### Raw Images (Slow)
![Evaluation Composite - Raw](eval_composite_raw.png)
*Same task without JPEG optimization*

## Usage

### Server Setup (RunPod)

```bash
# Clone OpenPi fork with JPEG support
git clone https://github.com/abdulrahman004/openpi.git
cd openpi
uv sync

# Download checkpoint
uv run huggingface-cli download abdul004/pi05_so101_checkpoint \
    --include "4999/**" \
    --local-dir checkpoints/pi05_so101

# Start server
uv run scripts/serve_policy.py --port 8000 \
    policy:checkpoint \
    --policy.config=pi05_so101 \
    --policy.dir=checkpoints/pi05_so101/4999
```

### Client (Robot Mac)

```bash
pip install openpi-client

# Run inference with JPEG compression
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net

# Or without compression (slower)
python run_pi05.py --server wss://YOUR-POD-8000.proxy.runpod.net --no-jpeg
```

## Comparison with ACT Policy

Trained on the same dataset:

| Policy | Architecture | Inference | Grasp | Generalization |
|--------|-------------|-----------|-------|----------------|
| **Pi0.5** | VLA (3B params) | Remote GPU | ✅ | ✅ Edge positions |
| ACT | Transformer (25M) | Local | ✅ | ⚠️ Center only |

**Key advantage:** Pi0.5 successfully picks up ball from edge positions that ACT couldn't handle - demonstrates better generalization from VLA pre-training.

## Infrastructure Notes

**Remote Inference Setup:**
- Server: RunPod RTX 4090 24GB (~$0.40/hr on-demand)
- Client: Mac Mini M4 controlling SO-101 robot
- Protocol: WebSocket with msgpack serialization
- Optimization: JPEG compression reduces 1.8MB → 71KB per inference

**Known Issues:**
- RTX 4090 is borderline for memory - occasional OOM during model loading
- US datacenters preferred (2x faster than EU for network transfer)
- First inference takes 30-60s (JAX JIT compilation)

## Limitations

- Requires GPU server for inference (not yet optimized for edge deployment)
- Sensitive to lighting changes
- 72 training episodes may limit extreme edge case handling

## Citation

```bibtex
@misc{so101_pi05_ball_in_cup,
  author = {Abdul},
  title = {SO-101 Ball-in-Cup Pi0.5 Fine-tuning},
  year = {2026},
  publisher = {Hugging Face},
  url = {https://huggingface.co/abdul004/pi05_so101_checkpoint}
}
```

## Acknowledgments

- [Physical Intelligence](https://www.physicalintelligence.company/) for Pi0.5 and OpenPi
- [LeRobot](https://github.com/huggingface/lerobot) by Hugging Face
- SO-101 robot design community