Image-Text-to-Text
Transformers
Safetensors
English
qwen3_5_text
text-generation
choonsik
VLA
Minecraft
vision-language-action
qwen3.5
conversational
Instructions to use Infinity08/Choonsik-Qwen3.5-9B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use Infinity08/Choonsik-Qwen3.5-9B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="Infinity08/Choonsik-Qwen3.5-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoTokenizer, AutoModelForCausalLM tokenizer = AutoTokenizer.from_pretrained("Infinity08/Choonsik-Qwen3.5-9B") model = AutoModelForCausalLM.from_pretrained("Infinity08/Choonsik-Qwen3.5-9B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = tokenizer.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use Infinity08/Choonsik-Qwen3.5-9B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "Infinity08/Choonsik-Qwen3.5-9B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infinity08/Choonsik-Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/Infinity08/Choonsik-Qwen3.5-9B
- SGLang
How to use Infinity08/Choonsik-Qwen3.5-9B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "Infinity08/Choonsik-Qwen3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infinity08/Choonsik-Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "Infinity08/Choonsik-Qwen3.5-9B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "Infinity08/Choonsik-Qwen3.5-9B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use Infinity08/Choonsik-Qwen3.5-9B with Docker Model Runner:
docker model run hf.co/Infinity08/Choonsik-Qwen3.5-9B
File size: 2,647 Bytes
db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 db8a08f a884d48 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 | ---
license: mit
base_model: Qwen/Qwen3.5-9B
tags:
- choonsik
- VLA
- Minecraft
- vision-language-action
- qwen3.5
- image-text-to-text
datasets:
- CraftJarvis/minecraft-vla-sft
library_name: transformers
language:
- en
pipeline_tag: image-text-to-text
---
# Choonsik — Minecraft Vision-Language-Action Model
Choonsik is a **Vision-Language-Action (VLA)** model for Minecraft, built on
[Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) and trained with the
three-stage **ActVLP** pipeline from
[JARVIS-VLA](https://arxiv.org/abs/2503.16365).
Given a Minecraft observation frame and a natural-language task instruction,
Choonsik outputs keyboard + mouse action tokens that can be executed directly
in the game — covering 1,000+ atomic tasks (crafting, mining, smelting, combat,
navigation, etc.).
| | |
|---|---|
| **Base model** | [Qwen/Qwen3.5-9B](<https://huggingface.co/Qwen/Qwen3.5-9B>) |
| **Training data** | [CraftJarvis/minecraft-vla-sft](<https://huggingface.co/datasets/CraftJarvis/minecraft-vla-sft>) (3.78M examples) |
| **Training stages** | Language → Vision-Language → Imitation Learning |
| **License** | MIT |
## Usage
```python
from choonsik.inference import ChoonsikInferenceRunner
from PIL import Image
runner = ChoonsikInferenceRunner("Infinity08/Choonsik-Qwen3.5-9B")
frame = Image.open("minecraft_frame.png")
action = runner.predict(frame, task="craft a wooden pickaxe")
# action = {"forward": 0, "attack": 1, ..., "camera": [0.0, 0.3]}
```
## Action Space
Choonsik predicts actions using **mu-law discretized tokens**:
| Token type | Count | Description |
|---|---|---|
| Keyboard | 29 | `forward`, `attack`, `use`, `jump`, hotbar 1–9, … |
| Mouse X | 21 | Horizontal camera rotation (mu-law bins) |
| Mouse Y | 21 | Vertical camera rotation (mu-law bins) |
## Training
Three-stage ActVLP pipeline (following JARVIS-VLA):
1. **Stage 1 — Language post-training**: Minecraft world knowledge (text-only SFT)
2. **Stage 2 — Vision-language alignment**: Image captioning and VQA on gameplay frames
3. **Stage 3 — Imitation learning**: Action prediction on 3.78M trajectory examples
Training hardware: L40S (48 GB VRAM). Inference: RTX 5080 with 4-bit NF4 quantization.
## Citation
If you use Choonsik or the underlying JARVIS-VLA methodology, please cite:
```bibtex
@article{li2025jarvisvla,
title = {JARVIS-VLA: Post-Training Large-Scale Vision Language Models
to Play Visual Games with Keyboards and Mouse},
author = {Muyao Li and Zihao Wang and Kaichen He and others},
journal = {arXiv preprint arXiv:2503.16365},
year = {2025}
}
```
|