---
tags:
- inverse-dynamics-model
- gameplay
- computer-vision
- fps-games
library_name: owl-idm
---

# Owl IDM - VPT-v0

Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video.

## Model Description

- **Input**: Sequence of RGB frames (128x128), normalized to [-1, 1]
- **Output**:
  - Button predictions (20 outputs): `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc`
  - Mouse movement (dx, dy in pixels)

## Architecture

Architecture is based on OpenAI VPT IDM, with some general improvements.

- **Backbone**: Conv3D temporal mixer → ResNet spatial encoder → learned spatial pooling
- **Temporal model**: Transformer (d_model=1024, 12 layers)
- **Window size**: 32 frames
- **Model size**: N/A parameters

## Training

- **Dataset**: FPS gameplay recordings
- **Preprocessing**: Frames scaled to [-1, 1], log1p mouse scaling: True
- **Loss**: BCE with class-balancing pos_weight for buttons, Huber for mouse

## Usage

### Installation

```bash
pip install git+https://github.com/overworld/owl-idm-3.git
```

### Inference

```python
from owl_idms import InferencePipeline
import torch

pipeline = InferencePipeline.from_pretrained(
    "Overworld/owl-idm-vpt-v0",
    device="cuda"
)

# video: [batch, frames, channels, height, width] in range [-1, 1]
video = torch.randn(1, 256, 3, 128, 128)

button_preds, mouse_preds = pipeline(video)
# button_preds: [1, 256, 20] bool  — order: `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc`
# mouse_preds:  [1, 256, 2]          float  — (dx, dy) in pixels

# Check which buttons are pressed at frame 100
for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]):
    if pressed:
        print(f"{label} pressed")
```

## Model Files

- `config.yml`: Full training configuration
- `model.pt`: EMA model weights (state_dict, ready for load_state_dict)

## License

MIT License