Instructions to use Infinity08/Choonsik-Qwen3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Infinity08/Choonsik-Qwen3.5-9B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="Infinity08/Choonsik-Qwen3.5-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoTokenizer, AutoModelForCausalLM

tokenizer = AutoTokenizer.from_pretrained("Infinity08/Choonsik-Qwen3.5-9B")
model = AutoModelForCausalLM.from_pretrained("Infinity08/Choonsik-Qwen3.5-9B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = tokenizer.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Infinity08/Choonsik-Qwen3.5-9B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Infinity08/Choonsik-Qwen3.5-9B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Infinity08/Choonsik-Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/Infinity08/Choonsik-Qwen3.5-9B

SGLang

How to use Infinity08/Choonsik-Qwen3.5-9B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Infinity08/Choonsik-Qwen3.5-9B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Infinity08/Choonsik-Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Infinity08/Choonsik-Qwen3.5-9B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Infinity08/Choonsik-Qwen3.5-9B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use Infinity08/Choonsik-Qwen3.5-9B with Docker Model Runner:
```
docker model run hf.co/Infinity08/Choonsik-Qwen3.5-9B
```

Choonsik-Qwen3.5-9B

File size: 2,647 Bytes

db8a08f
a884d48
db8a08f
 
a884d48
 
 
 
 
 
 
 
 
 
 
 
db8a08f
 
a884d48
 
 
 
 
 
 
 
 
 
 
db8a08f
a884d48
 
 
 
 
 
db8a08f
a884d48
db8a08f
 
a884d48
 
db8a08f
a884d48
 
db8a08f
a884d48
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
 
 
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
 
 
db8a08f
a884d48
db8a08f
a884d48
db8a08f
a884d48
db8a08f
 
a884d48
 
 
 
 
 
db8a08f
a884d48

---
license: mit
base_model: Qwen/Qwen3.5-9B
tags:
  - choonsik
  - VLA
  - Minecraft
  - vision-language-action
  - qwen3.5
  - image-text-to-text
datasets:
  - CraftJarvis/minecraft-vla-sft
library_name: transformers
language:
  - en
pipeline_tag: image-text-to-text
---

# Choonsik — Minecraft Vision-Language-Action Model

Choonsik is a **Vision-Language-Action (VLA)** model for Minecraft, built on
[Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) and trained with the
three-stage **ActVLP** pipeline from
[JARVIS-VLA](https://arxiv.org/abs/2503.16365).

Given a Minecraft observation frame and a natural-language task instruction,
Choonsik outputs keyboard + mouse action tokens that can be executed directly
in the game — covering 1,000+ atomic tasks (crafting, mining, smelting, combat,
navigation, etc.).

| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) |
| **Training data** | [CraftJarvis/minecraft-vla-sft](<https://huggingface.co/datasets/CraftJarvis/minecraft-vla-sft>) (3.78M examples) |
| **Training stages** | Language → Vision-Language → Imitation Learning |
| **License** | MIT |

## Usage

```python
from choonsik.inference import ChoonsikInferenceRunner
from PIL import Image

runner = ChoonsikInferenceRunner("Infinity08/Choonsik-Qwen3.5-9B")
frame  = Image.open("minecraft_frame.png")

action = runner.predict(frame, task="craft a wooden pickaxe")
# action = {"forward": 0, "attack": 1, ..., "camera": [0.0, 0.3]}
```

## Action Space

Choonsik predicts actions using **mu-law discretized tokens**:

| Token type | Count | Description |
|---|---|---|
| Keyboard | 29 | `forward`, `attack`, `use`, `jump`, hotbar 1–9, … |
| Mouse X | 21 | Horizontal camera rotation (mu-law bins) |
| Mouse Y | 21 | Vertical camera rotation (mu-law bins) |

## Training

Three-stage ActVLP pipeline (following JARVIS-VLA):

1. **Stage 1 — Language post-training**: Minecraft world knowledge (text-only SFT)
2. **Stage 2 — Vision-language alignment**: Image captioning and VQA on gameplay frames
3. **Stage 3 — Imitation learning**: Action prediction on 3.78M trajectory examples

Training hardware: L40S (48 GB VRAM). Inference: RTX 5080 with 4-bit NF4 quantization.

## Citation

If you use Choonsik or the underlying JARVIS-VLA methodology, please cite:

```bibtex
@article{li2025jarvisvla,
  title   = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models
             to Play Visual Games with Keyboards and Mouse},
  author  = {Muyao Li and Zihao Wang and Kaichen He and others},
  journal = {arXiv preprint arXiv:2503.16365},
  year    = {2025}
}
```