Instructions to use Phonsiri/Thai-Legal-Gemma-4B-CPT with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use Phonsiri/Thai-Legal-Gemma-4B-CPT with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="Phonsiri/Thai-Legal-Gemma-4B-CPT")

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("Phonsiri/Thai-Legal-Gemma-4B-CPT")
model = AutoModelForMultimodalLM.from_pretrained("Phonsiri/Thai-Legal-Gemma-4B-CPT", device_map="auto")

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use Phonsiri/Thai-Legal-Gemma-4B-CPT with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "Phonsiri/Thai-Legal-Gemma-4B-CPT"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phonsiri/Thai-Legal-Gemma-4B-CPT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker

docker model run hf.co/Phonsiri/Thai-Legal-Gemma-4B-CPT

SGLang

How to use Phonsiri/Thai-Legal-Gemma-4B-CPT with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "Phonsiri/Thai-Legal-Gemma-4B-CPT" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phonsiri/Thai-Legal-Gemma-4B-CPT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "Phonsiri/Thai-Legal-Gemma-4B-CPT" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "Phonsiri/Thai-Legal-Gemma-4B-CPT",
		"prompt": "Once upon a time,",
		"max_tokens": 512,
		"temperature": 0.5
	}'

Docker Model Runner
How to use Phonsiri/Thai-Legal-Gemma-4B-CPT with Docker Model Runner:
```
docker model run hf.co/Phonsiri/Thai-Legal-Gemma-4B-CPT
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Thai Legal Gemma 4B — CPT Checkpoint

Thai-Legal-Gemma-4B-CPT คือ Gemma-4-E4B ที่ผ่านกระบวนการ Continued Pre-Training (CPT) บนชุดข้อมูลกฎหมายไทยขนาดใหญ่ เพื่อปลูกฝังความเข้าใจระดับรากฐานด้านภาษาและโครงสร้างของกฎหมายไทย

หมายเหตุ: นี่คือ CPT Checkpoint ระหว่างการเทรน (In-Progress) ยังไม่ใช่โมเดลสมบูรณ์พร้อมใช้งาน โมเดลนี้ยังไม่ได้ผ่านกระบวนการ Instruction Fine-tuning (SFT) หรือ Alignment (RLHF) ดังนั้นจึงไม่เหมาะสำหรับการนำไปใช้งานในระบบที่ต้องการความแม่นยำสูงในทันที

ภาพรวมโครงการ

รายการ	รายละเอียด
Base Model	`google/gemma-4-E4B`
วิธีการเทรน	Full Continued Pre-Training (CPT)
จุดประสงค์	สร้าง Legal Foundation Model ภาษาไทย
Context Length	8,192 tokens
Precision	bfloat16
Hardware	NVIDIA H200 (141 GB VRAM)
Optimizer	AdamW Fused
สถานะ	กำลังเทรน (In Progress)

ชุดข้อมูลที่ใช้เทรน

โมเดลนี้เทรนด้วยการผสมข้อมูลหลายประเภทในสัดส่วน (Mix Ratio) ที่ออกแบบมาเพื่อเน้นกฎหมายไทยเป็นหลัก:

ประเภทข้อมูล	สัดส่วน	แหล่งที่มา
กฎหมายไทย	70%	ราชกิจจานุเบกษา (soc-ratchakitcha) 2011–2025, ThaiLaw, WangchanX-Legal-ThaiCCL-RAG
ภาษาไทยทั่วไป	15%	Thai Wikipedia
ภาษาอังกฤษ	10%	C4 English (subset)
โค้ด	3%	The Stack (Python/SQL/Markdown)
กฎหมายอังกฤษ	2%	pile-of-law

แหล่งข้อมูลกฎหมายไทย

ราชกิจจานุเบกษา (2011–2025): ~145,912 ฉบับ จาก open-law-data-thailand/soc-ratchakitcha ครอบคลุมพระราชบัญญัติ, กฎกระทรวง, ระเบียบ, ประกาศทั่วไป
ThaiLaw: 42,755 บทความ ประมวลกฎหมายแพ่งและพาณิชย์, อาญา, วิธีพิจารณาความ
WangchanX-Legal-ThaiCCL-RAG: 8,211 คู่ถาม-ตอบเชิงกฎหมาย (ใช้เป็น Corpus ไม่ใช่ SFT)

รายละเอียดการเทรน

Training Objective : Next-Token Prediction (CLM)
Sequence Length : 8,192 tokens (Packed, no padding)
Batch Size : 1 (effective 256 via gradient accumulation)
Gradient Accum. : 256 steps
Optimizer : AdamW (adamw_torch_fused)
Learning Rate : 2e-5 (with warmup)
Precision : bf16
Gradient Checkpointing: True
KV Cache : Disabled (use_cache=False)
Hub Strategy : all_checkpoints (auto-save every 5 steps)

วิธีใช้งาน

โหลดโมเดลปกติ (สำหรับเครื่องที่มี VRAM เพียงพอ ≥ 18 GB)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "Phonsiri/Thai-Legal-Gemma-4B-CPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 torch_dtype=torch.bfloat16,
 device_map="auto",
)

prompt = "ตามประมวลกฎหมายแพ่งและพาณิชย์ การกู้ยืมเงินเกินกว่าสองพันบาทขึ้นไป"
inputs = tokenizer(prompt, return_tensors="pt").to("cuda")

outputs = model.generate(
 **inputs,
 max_new_tokens=300,
 temperature=0.3,
 top_p=0.9,
 repetition_penalty=1.1,
 do_sample=True,
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

โหลดแบบ 4-bit (สำหรับ Colab / GPU แรม ≤ 16 GB)

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_id = "Phonsiri/Thai-Legal-Gemma-4B-CPT"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
 model_id,
 quantization_config=BitsAndBytesConfig(
 load_in_4bit=True,
 bnb_4bit_compute_dtype=torch.bfloat16,
 ),
 device_map="auto",
)

ความคืบหน้าการเทรน

โมเดลนี้กำลังเทรนต่อเนื่อง Checkpoint จะถูก Push ขึ้น Hub โดยอัตโนมัติทุก 5 steps

Metric	ค่าล่าสุด
Training Loss	~1.55–1.59 (early stage)
Grad Norm	~200–800 (normalizing)
Learning Rate	กำลัง warmup ขึ้น

Loss ในช่วงแรกของ CPT ที่ค่า ~1.5 ถือว่าปกติและเป็นสัญญาณที่ดีครับ โมเดลกำลังปรับตัวเข้ากับโครงสร้างภาษากฎหมายไทย

แผนการพัฒนา (Roadmap)

[] Phase 1: Data Pipeline — ราชกิจจาฯ OCR + Wikipedia + Legal Datasets
[] Phase 2: CPT Training — Full pre-training บน H200 (กำลังดำเนินการ)
[ ] Phase 3: SFT — ถาม-ตอบกฎหมาย แบบ Instruction Following
[ ] Phase 4: GRPO/RLHF — Legal Reasoning (IRAC Framework)
[ ] Phase 5: RAG Integration — เชื่อมต่อฐานข้อมูลกฎหมายแบบ Real-time

ข้อจำกัดและคำเตือน

โมเดลนี้ไม่ใช่คำแนะนำทางกฎหมายอย่างเป็นทางการ ห้ามนำไปใช้ทดแทนการปรึกษาทนายความจริงๆ
ในช่วง CPT โมเดลอาจสร้างข้อความที่มีข้อผิดพลาดทางกฎหมายได้ (Hallucination)
ยังไม่ผ่าน Safety Alignment — ไม่เหมาะสำหรับ Production ในทันที

License

โมเดลนี้ใช้ Base Model จาก Google Gemma ซึ่งอยู่ภายใต้ Gemma Terms of Use ชุดข้อมูลราชกิจจานุเบกษาอยู่ภายใต้ CC BY-SA 4.0

ขอบคุณ

Open Law Data Thailand — สำหรับชุดข้อมูลราชกิจจานุเบกษาและกฎหมายไทยคุณภาพสูง
Google DeepMind — สำหรับ Gemma Base Model
iApp Technology — สำหรับ OCR ราชกิจจานุเบกษา

Downloads last month: 418

Safetensors

Model size

8B params

Tensor type

BF16

Model tree for Phonsiri/Thai-Legal-Gemma-4B-CPT

Base model

google/gemma-4-E4B

Finetuned

(81)

this model

Quantizations

1 model

Phonsiri
/

Thai-Legal-Gemma-4B-CPT