---
license: apache-2.0
language:
- ko
- en
library_name: transformers
base_model: Qwen/Qwen3.5-27B
pipeline_tag: image-text-to-text
tags:
- ocr
- korean
- english
- document-ai
- document-understanding
- vision-language
- qwen3
- qwen3-vl
- structured-extraction
- public-document
- academic-paper
---

# Qwen3.5-27B OCR SFT v1

**Korean Public Documents & English Academic Papers · OCR · Semantic Structuring**
**한국 공공문서 · 영어 논문 · OCR · 의미 분석 & 구조화 파인튜닝 모델**

This model is **not a plain OCR model**. It is a Qwen3.5-VL 27B checkpoint fine-tuned to **interpret the meaning of document content and re-emit it as structured output** (Markdown, hierarchical tables, key-value fields, sectioned blocks) — trained with a **balanced Korean / English** corpus that emphasizes Korean public-sector documents and English research papers.

이 모델은 **단순 텍스트 인식 OCR이 아닙니다.** Qwen3.5-VL 27B를 베이스로, **한국 공공문서의 의미를 해석하고 구조화된 표현**(Markdown / 계층적 표 / 키-값 / 섹션)으로 출력하도록 파인튜닝한 버전입니다. 학습 데이터는 **AIHub의 한국 공공기관 문서**와 **HuggingFace 상의 영어 논문 데이터셋**을 **한국어·영어 균형 있게** 구성했습니다.

> ⚠️ v1 — experimental checkpoint. Evaluation metrics and training recipe will be expanded in later revisions. / 실험적 체크포인트입니다.

---

## 🎯 What this model is good at / 이 모델의 강점

A generic OCR model "reads the glyphs." This model **reads, then understands, then reorganizes**.

- **Structured output** — Identifies titles, body text, tables, lists, signature blocks, stamp regions; re-emits as Markdown / HTML tables / JSON-like structure.
  **구조화 출력** — 제목·본문·표·목록·서명란·도장 영역을 식별하고 Markdown / HTML 표 / JSON-like 구조로 재구성.

- **Semantic analysis** — Not a raw string dump; fields are grouped by *what they mean* (발신기관, 문서번호, 결재선, 수신처, 시행일자, 붙임 / sender, doc-number, routing, addressee, effective date, attachments).
  **의미 분석** — 스트링 추출이 아닌 *"이 항목이 무엇을 의미하는가"* 기준으로 필드를 정리.

- **Korean public-document specificity** — Handles 공문 서식, 관인/직인 영역, 기관 특유 표기(`○○시장`, `붙임`, `수신자 참조`), hierarchical legal numbering (제1조–제2항–제3호).
  **한국 공공문서 특성 반영** — 공문 서식, 관인·직인, 기관 고유 표기, 제N조–제N항–제N호 계층.

- **English academic papers** — Abstract / section / figure-caption segmentation, citation-friendly reading order, math-adjacent tables.
  **영어 논문** — 초록·섹션·그림 캡션 분리, 인용을 고려한 읽기 순서, 표/수식 인접 구조 복원.

- **Complex tables** — Merged cells, multi-row headers, empty cells, mixed units, footnoted tables — reconstructed as semantic units.
  **복잡한 표 처리** — 병합 셀, 다중 헤더, 빈 칸, 단위 혼재, 주석 포함 표를 의미 단위로 복원.

- **Long context** — Up to **262K tokens**; multi-page documents can be processed in one pass.
  **장문 문맥** — 최대 **262K 토큰**으로 수십 페이지를 단일 호출에 처리.

---

## 📌 Specifications / 주요 사양

| | |
|---|---|
| **Base model** | [`Qwen/Qwen3.5-27B`](https://huggingface.co/Qwen/Qwen3.5-27B) |
| **Architecture** | `Qwen3_5ForConditionalGeneration` (Hybrid Linear + Full Attention) |
| **Parameters** | ≈ 27B · merged full weights (no adapter) |
| **Precision** | bfloat16 |
| **Context length** | 262,144 tokens |
| **Vocab size** | 248,320 |
| **Vision patch / merge** | 16 / 2 (Qwen3VLProcessor) |
| **MTP module** | 1 layer (`model-mtp.safetensors`, optional serving) |
| **Fine-tune type** | Full-parameter SFT, merged checkpoint |
| **Languages** | Korean ↔ English (balanced) |
| **License** | Apache-2.0 |

---

## 📚 Training Data / 학습 데이터

Balanced Korean / English corpus, document-centric:
한국어 / 영어 균형 구성, 문서 중심 코퍼스:

- **AIHub Korean public-sector document datasets** — 공문, 고시, 공고, 신청서, 증명서, 결재 문서, 회의록, 각종 행정 양식.
  AIHub의 한국 공공기관 문서 데이터셋들: 공문 / 고시 / 공고 / 신청서 / 증명서 / 결재 문서 / 회의록 / 행정 양식.
- **HuggingFace English academic paper datasets** — abstract, figures, tables, bibliography-style layouts.
  HuggingFace 상의 영어 논문 데이터셋: 초록, 그림, 표, 참고문헌 레이아웃.

Tasks covered during SFT: OCR ground-truth transcription, Markdown/HTML structuring, field extraction, short semantic summary.
SFT 과업: OCR 원문 전사, Markdown/HTML 구조화, 필드 추출, 짧은 의미 요약.

---

## 🚀 Quick Start / 빠른 시작

### 1) Transformers

```python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "Sooryeon/qwen3.5-27b-ocr-sft-v1"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id, dtype="bfloat16", device_map="auto"
)

image = Image.open("document.png").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text":
            "Convert this document to Markdown, preserving layout. "
            "Render tables as HTML <table> with rowspan/colspan. "
            "Emit metadata fields (doc number, date, sender, recipient) as a separate block."
        },
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

out = model.generate(**inputs, max_new_tokens=8192)
print(processor.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
```

### 2) vLLM serving

```bash
vllm serve Sooryeon/qwen3.5-27b-ocr-sft-v1 \
    --dtype bfloat16 \
    --max-model-len 32768 \
    --gpu-memory-utilization 0.90 \
    --trust-remote-code
```

OpenAI-compatible multimodal endpoint. Send images as `image_url` content parts.
OpenAI 호환 멀티모달 엔드포인트 — 이미지를 `image_url` 컨텐츠 파트로 전송하세요.

---

## 💡 Prompt Patterns / 권장 프롬프트 패턴

Different instructions unlock different layers of output from the same model.
지시문을 달리하면 동일 모델이 다른 층위의 출력을 냅니다.

**① Layout-faithful Markdown / 문서 구조 복원**
```
Convert this document to Markdown preserving the original layout.
- Heading hierarchy as #/##/###
- Tables as HTML <table> including rowspan/colspan
- Footnotes, attachments, signature blocks as separate sections
```

**② Field extraction / 필드 추출 (JSON)**
```
Extract the following fields from this document:
  sender_org, recipient, doc_number, effective_date, title, author,
  approval_status, attachments
Respond as JSON only.
```

**③ Semantic summary + structure / 의미 요약 + 구조화**
```
Summarize the document in 3 sentences, then list the key items as
hierarchical bullet points grouped by semantic role.
```

---

## ⚙️ Serving Tips / 서빙 팁

- **Context length** — Short-to-medium documents are in-distribution; for very long documents, page-chunked serving is more stable (tune `--max-model-len` per use case).
  학습 맥락은 중·단문 중심 — 매우 긴 문서는 페이지 단위 청킹이 안정적.
- **Image resolution** — The image processor accepts up to `longest_edge = 16,777,216` pixels, but practical VRAM-friendly range is **longest side 1,792–2,560 px**.
  실효 해상도는 장변 1,792–2,560px 권장.
- **Sampling** — For structured output, `temperature=0.2–0.4`; for number / table fidelity, use `temperature=0.0`.
  구조화 출력은 `temperature=0.2~0.4`, 표·숫자 정확도 우선이면 `0.0` 권장.
- **MTP** — `model-mtp.safetensors` contains Multi-Token Prediction weights. Only enable on engines that support MTP; HF transformers inference does not require it.
  MTP 가중치는 지원 엔진에서만 활성화 — 일반 transformers 추론에는 불필요.

---

## 📉 Limitations / 한계

- **v1 experimental checkpoint** — Natural images, handwriting, equations, and other domains outside the training mix show higher variance.
  v1 실험 체크포인트 — 자연 이미지·필기체·수식 등 학습 분포 밖은 편차 큼.
- **Potential hallucination** — Semantic summarization / structuring can embellish. Cross-check against source for any legal or official use.
  환각 가능성 — 구조화·요약에서 표현이 보완될 수 있으므로 법적·공적 용도는 원문 교차 검증 필수.
- **PII awareness** — Public documents may contain Korean resident IDs, phone numbers, etc. Apply masking in your downstream pipeline.
  공공문서에는 주민등록번호·연락처 등이 포함될 수 있으므로 별도 마스킹 적용.

---

## 📄 License

Apache-2.0 — see `LICENSE` or [apache.org/licenses/LICENSE-2.0](https://www.apache.org/licenses/LICENSE-2.0).
Base model usage also follows [Qwen/Qwen3.5-27B](https://huggingface.co/Qwen/Qwen3.5-27B) terms.

---

## 📚 Citation / 인용

```bibtex
@misc{sooryeon2026qwen35ocrsftv1,
  title  = {Qwen3.5-27B OCR SFT v1: Korean Public Documents and English
            Academic Papers with Semantic Structuring},
  author = {Sooryeon},
  year   = {2026},
  url    = {https://huggingface.co/Sooryeon/qwen3.5-27b-ocr-sft-v1}
}
```

## 🔗 Links

- Base model: [`Qwen/Qwen3.5-27B`](https://huggingface.co/Qwen/Qwen3.5-27B)
- This model: [`Sooryeon/qwen3.5-27b-ocr-sft-v1`](https://huggingface.co/Sooryeon/qwen3.5-27b-ocr-sft-v1)

---

*Fine-tuned with care for Korean public-sector and English academic document understanding.*
*한국 공공문서 및 영어 학술 문서 이해를 위해 세심하게 파인튜닝되었습니다.*