--- tags: - inverse-dynamics-model - gameplay - computer-vision - fps-games library_name: owl-idm --- # Owl IDM - VPT-v0 Inverse Dynamics Model (IDM) that predicts keyboard and mouse inputs from gameplay video. ## Model Description - **Input**: Sequence of RGB frames (128x128), normalized to [-1, 1] - **Output**: - Button predictions (20 outputs): `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc` - Mouse movement (dx, dy in pixels) ## Architecture Architecture is based on OpenAI VPT IDM, with some general improvements. - **Backbone**: Conv3D temporal mixer → ResNet spatial encoder → learned spatial pooling - **Temporal model**: Transformer (d_model=1024, 12 layers) - **Window size**: 32 frames - **Model size**: N/A parameters ## Training - **Dataset**: FPS gameplay recordings - **Preprocessing**: Frames scaled to [-1, 1], log1p mouse scaling: True - **Loss**: BCE with class-balancing pos_weight for buttons, Huber for mouse ## Usage ### Installation ```bash pip install git+https://github.com/overworld/owl-idm-3.git ``` ### Inference ```python from owl_idms import InferencePipeline import torch pipeline = InferencePipeline.from_pretrained( "Overworld/owl-idm-vpt-v0", device="cuda" ) # video: [batch, frames, channels, height, width] in range [-1, 1] video = torch.randn(1, 256, 3, 128, 128) button_preds, mouse_preds = pipeline(video) # button_preds: [1, 256, 20] bool — order: `W`, `A`, `S`, `D`, `LShift`, `F`, `LMB`, `RMB`, `Space`, `R`, `E`, `V`, `C`, `Ctrl`, `1`, `2`, `3`, `I`, `Tab`, `Esc` # mouse_preds: [1, 256, 2] float — (dx, dy) in pixels # Check which buttons are pressed at frame 100 for label, pressed in zip(pipeline.button_labels, button_preds[0, 100]): if pressed: print(f"{label} pressed") ``` ## Model Files - `config.yml`: Full training configuration - `model.pt`: EMA model weights (state_dict, ready for load_state_dict) ## License MIT License