---
license: apache-2.0
tags:
  - reinforcement-learning
  - robotics
  - quadruped
  - locomotion
  - isaac-lab
  - ppo
  - rsl-rl
library_name: rsl-rl
pipeline_tag: reinforcement-learning
---

# Go2+Z1 Walking Policy (V1, state-only PPO)

PPO walking policy for the **Unitree Go2 + Z1** composite robot (12 leg DOFs + 6 arm DOFs = 18 DOF), trained in Isaac Lab on flat ground while holding the Z1 arm folded on the back.

## Highlights

- Backbone: rsl-rl `OnPolicyRunner` actor-critic (MLP 512-256-128, ELU)
- Task: `Isaac-Velocity-Flat-Go2Z1-v0` (forward/lateral linear vel + small yaw rate commands)
- 4096 parallel envs × 1500 PPO iters on a single RTX PRO 6000 Blackwell (96 GB)
- Z1 arm forced to remain in the folded "startFlat" pose during locomotion
- Verified: walks 10 m inside the real `Simple_Warehouse/warehouse.usd` (3/3 episodes)

## Files

- `model_*.pt` — checkpoint dictionaries with `actor_state_dict` / `critic_state_dict`

## Architecture

```
Actor MLP : Linear(obs→512) ELU Linear(512→256) ELU Linear(256→128) ELU Linear(128→12)
Critic MLP: same shape, single value head
Inputs    : base lin_vel + ang_vel + projected_gravity + commands + joint_pos + joint_vel + last_action
Outputs   : 12 leg joint position deltas (Go2 hip/thigh/calf × 4)
```

## Usage

```python
import torch, torch.nn as nn

# Load checkpoint
state = torch.load("model_1499.pt", map_location="cuda:0", weights_only=False)
sd = state["actor_state_dict"]

# Rebuild actor (3 hidden layers + output)
h, obs_dim = sd["mlp.0.weight"].shape[0], sd["mlp.0.weight"].shape[1]
act_dim = sd["mlp.6.weight"].shape[0]
actor = nn.Sequential(
    nn.Linear(obs_dim, h), nn.ELU(),
    nn.Linear(h, h), nn.ELU(),
    nn.Linear(h, h), nn.ELU(),
    nn.Linear(h, act_dim),
).cuda().eval()
actor.load_state_dict({k.replace("mlp.", ""): v for k, v in sd.items() if k.startswith("mlp.")})

# obs comes from Isaac Lab's Isaac-Velocity-Flat-Go2Z1-Play-v0 env
with torch.inference_mode():
    action = actor(obs)
```

For end-to-end inference inside Isaac Sim, see [`stage4_joint_eval/walk_in_real_warehouse.py`](https://github.com/aws300/go2_z1_warehouse/blob/main/go2_z1_warehouse/stage4_joint_eval/walk_in_real_warehouse.py).

## Training data

This is an **on-policy RL** model — no offline dataset is used. The policy is trained from scratch by interacting with the simulator. The full task definition (rewards, observations, terminations) lives in:

- Repo: <https://github.com/aws300/go2_z1_warehouse>
- Task config: `go2_z1_warehouse/stage1_walking/{flat_env_cfg.py, rough_env_cfg.py}`

## Eval results

| Scenario | Episodes | Success | Mean traveled |
|---|---|---|---|
| Flat plane | 10 | 100 % | — |
| 4 cuboid shelves | 5 | 80 %  | 11.21 m |
| Real `warehouse.usd` | 3 | 100 % | 10.00 m |

## Citation

```bibtex
@misc{go2z1-walking-v1,
  title  = {Go2+Z1 Warehouse Walking Policy V1 (state-only PPO)},
  author = {m3},
  year   = {2026},
  url    = {https://huggingface.co/m3/go2z1-walking-rsl-rl-v1}
}
```

## Successor

- V2 (rotation-capable + heading-tracking): [m3/go2z1-walking-rsl-rl-v2](https://huggingface.co/m3/go2z1-walking-rsl-rl-v2)