---
license: apache-2.0
base_model: Qwen/Qwen3-4B
tags:
  - unsloth
  - text-generation
  - education
  - game-generation
language:
  - id
pipeline_tag: text-generation
---

# Panduan Deployment SR4 di RunPod

Model ini (`ub-sr04-qwen3.5-4b-cpt2-sft-game`) adalah LLM game content generator untuk platform Sekolah Rakyat. Dijalankan sebagai inference server OpenAI-compatible menggunakan **Unsloth** — bukan vLLM (arsitektur hybrid Qwen3 tidak kompatibel dengan vLLM).

> File `server.py` sudah tersedia di repo ini — ikut ter-download bersama model.

---

## Kebutuhan Hardware

| GPU | VRAM | Mode |
|---|---|---|
| A100 40GB | 40 GB | bfloat16 (~9 GB dipakai) |
| L4 24GB | 24 GB | bfloat16 atau 4-bit (~3 GB) |
| T4 16GB | 16 GB | Harus 4-bit |

**Storage:** minimal 20 GB · **RAM:** minimal 16 GB CPU

---

## Setup di RunPod

### 1. Buat Pod

Di [runpod.io](https://runpod.io) → Deploy:
- Template: **RunPod PyTorch**
- Container Disk: 20 GB · Volume: 20 GB (mount ke `/workspace`)
- Expose port yang diinginkan (default: `8081`)

### 2. Download Model + Server Script

```bash
huggingface-cli login   # butuh token HF dengan akses ke repo ini

huggingface-cli download aitf-ub-2026/ub-sr04-qwen3.5-4b-cpt2-sft-game \
    --local-dir /workspace/models/sr4
```

`server.py` ikut ter-download ke `/workspace/models/sr4/server.py`.
Model tersimpan di volume persistent — tidak perlu download ulang setelah restart pod.

### 3. Install Dependencies

```bash
pip install --upgrade pip
pip install unsloth unsloth_zoo accelerate bitsandbytes
pip install fastapi "uvicorn[standard]"
pip install git+https://github.com/huggingface/transformers.git
```

> `torch` sudah tersedia di RunPod PyTorch template — tidak perlu install ulang.
> `bitsandbytes` diperlukan untuk mode `--4bit`.

### 4. Jalankan Server

```bash
# bfloat16 (default)
python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081

# 4-bit — untuk GPU VRAM terbatas (L4, T4)
python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081 --4bit
```

Server siap saat muncul log:
```
[SR4] Model loaded — GPU X.X / XX.X GB
[SR4] Serving on http://0.0.0.0:8081
```

### Auto-start setelah Restart

RunPod tidak auto-restart service. Tambahkan ke crontab:

```bash
crontab -e
# tambahkan:
@reboot sleep 30 && python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081
```

---

## API

### Endpoints

```
GET  /health               → {"status": "ok", "model": "..."}
GET  /v1/models            → daftar model tersedia
POST /v1/chat/completions  → generate (OpenAI-compatible)
```

### Contoh Request

```bash
curl -X POST http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sr4-game",
    "messages": [
      {"role": "system", "content": "...system prompt..."},
      {"role": "user",   "content": "{\"difficulty\": 1, \"atps\": [...], \"bacaan\": \"...\"}"}
    ],
    "max_tokens": 3500,
    "temperature": 0.0
  }'
```

Response field `choices[0].message.content` berisi game JSON (sudah di-strip dari markdown fence dan thinking block).

---

## Catatan

- **Chat template:** ChatML (`<|im_start|>` / `<|im_end|>`)
- **Max seq length:** 4096 token
- **Thinking dinonaktifkan** — output langsung JSON tanpa blok `<think>`
- **Log "MISSING: model.visual.\*"** dari Unsloth adalah normal — model ini pure text, tidak ada vision encoder yang aktif saat inference