File size: 2,647 Bytes
db8a08f
a884d48
db8a08f
 
a884d48
 
 
 
 
 
 
 
 
 
 
 
db8a08f
 
a884d48
 
 
 
 
 
 
 
 
 
 
db8a08f
a884d48
 
 
 
 
 
db8a08f
a884d48
db8a08f
 
a884d48
 
db8a08f
a884d48
 
db8a08f
a884d48
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
 
 
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
db8a08f
 
a884d48
 
 
 
 
 
db8a08f
a884d48
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
---
license: mit
base_model: Qwen/Qwen3.5-9B
tags:
  - choonsik
  - VLA
  - Minecraft
  - vision-language-action
  - qwen3.5
  - image-text-to-text
datasets:
  - CraftJarvis/minecraft-vla-sft
library_name: transformers
language:
  - en
pipeline_tag: image-text-to-text
---

# Choonsik — Minecraft Vision-Language-Action Model

Choonsik is a **Vision-Language-Action (VLA)** model for Minecraft, built on
[Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) and trained with the
three-stage **ActVLP** pipeline from
[JARVIS-VLA](https://arxiv.org/abs/2503.16365).

Given a Minecraft observation frame and a natural-language task instruction,
Choonsik outputs keyboard + mouse action tokens that can be executed directly
in the game — covering 1,000+ atomic tasks (crafting, mining, smelting, combat,
navigation, etc.).

| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) |
| **Training data** | [CraftJarvis/minecraft-vla-sft](<https://huggingface.co/datasets/CraftJarvis/minecraft-vla-sft>) (3.78M examples) |
| **Training stages** | Language → Vision-Language → Imitation Learning |
| **License** | MIT |

## Usage

```python
from choonsik.inference import ChoonsikInferenceRunner
from PIL import Image

runner = ChoonsikInferenceRunner("Infinity08/Choonsik-Qwen3.5-9B")
frame  = Image.open("minecraft_frame.png")

action = runner.predict(frame, task="craft a wooden pickaxe")
# action = {"forward": 0, "attack": 1, ..., "camera": [0.0, 0.3]}
```

## Action Space

Choonsik predicts actions using **mu-law discretized tokens**:

| Token type | Count | Description |
|---|---|---|
| Keyboard | 29 | `forward`, `attack`, `use`, `jump`, hotbar 1–9, … |
| Mouse X | 21 | Horizontal camera rotation (mu-law bins) |
| Mouse Y | 21 | Vertical camera rotation (mu-law bins) |

## Training

Three-stage ActVLP pipeline (following JARVIS-VLA):

1. **Stage 1 — Language post-training**: Minecraft world knowledge (text-only SFT)
2. **Stage 2 — Vision-language alignment**: Image captioning and VQA on gameplay frames
3. **Stage 3 — Imitation learning**: Action prediction on 3.78M trajectory examples

Training hardware: L40S (48 GB VRAM). Inference: RTX 5080 with 4-bit NF4 quantization.

## Citation

If you use Choonsik or the underlying JARVIS-VLA methodology, please cite:

```bibtex
@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models
             to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and others},
  journal = {arXiv preprint arXiv:2503.16365},
  year    = {2025}
}
```