Panduan Deployment SR4 di RunPod

Model ini (ub-sr04-qwen3.5-4b-cpt2-sft-game) adalah LLM game content generator untuk platform Sekolah Rakyat. Dijalankan sebagai inference server OpenAI-compatible menggunakan Unsloth — bukan vLLM (arsitektur hybrid Qwen3 tidak kompatibel dengan vLLM).

File server.py sudah tersedia di repo ini — ikut ter-download bersama model.

Kebutuhan Hardware

GPU	VRAM	Mode
A100 40GB	40 GB	bfloat16 (~9 GB dipakai)
L4 24GB	24 GB	bfloat16 atau 4-bit (~3 GB)
T4 16GB	16 GB	Harus 4-bit

Storage: minimal 20 GB · RAM: minimal 16 GB CPU

Setup di RunPod

1. Buat Pod

Di runpod.io → Deploy:

Template: RunPod PyTorch
Container Disk: 20 GB · Volume: 20 GB (mount ke /workspace)
Expose port yang diinginkan (default: 8081)

2. Download Model + Server Script

huggingface-cli login   # butuh token HF dengan akses ke repo ini

huggingface-cli download aitf-ub-2026/ub-sr04-qwen3.5-4b-cpt2-sft-game \
    --local-dir /workspace/models/sr4

server.py ikut ter-download ke /workspace/models/sr4/server.py. Model tersimpan di volume persistent — tidak perlu download ulang setelah restart pod.

3. Install Dependencies

pip install --upgrade pip
pip install unsloth unsloth_zoo accelerate bitsandbytes
pip install fastapi "uvicorn[standard]"
pip install git+https://github.com/huggingface/transformers.git

torch sudah tersedia di RunPod PyTorch template — tidak perlu install ulang. bitsandbytes diperlukan untuk mode --4bit.

4. Jalankan Server

# bfloat16 (default)
python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081

# 4-bit — untuk GPU VRAM terbatas (L4, T4)
python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081 --4bit

Server siap saat muncul log:

[SR4] Model loaded — GPU X.X / XX.X GB
[SR4] Serving on http://0.0.0.0:8081

Auto-start setelah Restart

RunPod tidak auto-restart service. Tambahkan ke crontab:

crontab -e
# tambahkan:
@reboot sleep 30 && python /workspace/models/sr4/server.py --model /workspace/models/sr4 --port 8081

API

Endpoints

GET  /health               → {"status": "ok", "model": "..."}
GET  /v1/models            → daftar model tersedia
POST /v1/chat/completions  → generate (OpenAI-compatible)

Contoh Request

curl -X POST http://localhost:8081/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "sr4-game",
    "messages": [
      {"role": "system", "content": "...system prompt..."},
      {"role": "user",   "content": "{\"difficulty\": 1, \"atps\": [...], \"bacaan\": \"...\"}"}
    ],
    "max_tokens": 3500,
    "temperature": 0.0
  }'

Response field choices[0].message.content berisi game JSON (sudah di-strip dari markdown fence dan thinking block).

Catatan

Chat template: ChatML (<|im_start|> / <|im_end|>)
Max seq length: 4096 token
Thinking dinonaktifkan — output langsung JSON tanpa blok <think>
Log "MISSING: model.visual.*" dari Unsloth adalah normal — model ini pure text, tidak ada vision encoder yang aktif saat inference

Downloads last month: 96

Safetensors

Model size

4B params

Tensor type

BF16

Model tree for aitf-ub-2026/ub-sr04-qwen3.5-4b-cpt2-sft-game

Base model

Qwen/Qwen3-4B-Base

Finetuned

Qwen/Qwen3-4B

Finetuned

(864)

this model