---
license: apache-2.0
language:
- ja
base_model:
- nlp-waseda/roberta-base-japanese
---

# LayoutLMv3-Japanese-preview

LayoutLMv3-Japanese-preview is a multimodal pre-trained model for Japanese Document AI, 
built on the [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) architecture. 
The text tokenizer is replaced with a Japanese tokenizer from 
[nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese), 
and the visual tokenizer adopts [BEiT v2](https://github.com/microsoft/unilm/tree/master/beit2) VQ-KD. 
The model is pre-trained on ~20M Japanese web pages from NDL WARP ,PubLayNet and DocLayNet for document layout analysis.


## Training Data

### Pre-training

| Dataset |  Scale |
|---|---|
| [NDL WARP](https://warp.da.ndl.go.jp/) |  20M+ pages |
| [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) |  360K+ images |
| [DocLayNet](https://github.com/DS4SD/DocLayNet) | 80K+ images |

### Evaluation

| Dataset | Task | Language |
|---|---|---|
| [FUNSD](https://guillaumejaume.github.io/FUNSD/) | Form Understanding | English |
| [XFUND-JA](https://github.com/doc-analysis/XFUND) | Form Understanding | Japanese |
| [JDocQA](https://github.com/mizuumi/JDocQA) | Document Visual Question Answering | Japanese |

### Tokenizer and Visual Tokenizer

| Component | Source | License |
|---|---|---|
| Text Tokenizer | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | Apache 2.0 |
| Visual Tokenizer (VQ-KD) | [BEiT v2](https://github.com/microsoft/unilm/tree/master/beit2) | MIT |
| PDF Text/BBox Extractor (pre-training) | [PyMuPDF (`fitz`)](https://pymupdf.readthedocs.io/) | AGPL v3 / commercial |

## Evaluation Results

### FUNSD (English Form Understanding)

| Model | F1 (test) |
|---|---|
| [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) | 0.9059 |
| [LayoutLMv2](https://huggingface.co/microsoft/layoutlmv2-base-uncased) | 0.8276 |
| LayoutLMv3-Japanese-preview (Ours) | 0.8284 |

### XFUND-JA (Japanese Form Understanding)

| Model | F1 (test) |
|---|---|
| [LayoutXLM](https://huggingface.co/microsoft/layoutxlm-base) | 0.7921 |
| LayoutLMv3-Japanese-preview (Ours) | 0.7436 |

### JDocQA (Japanese Document Visual Question Answering)

| Task | F1 (test) | Accuracy (test) |
|---|---|---|
| **LayoutLMv3-Japanese-preview (Ours)** | | |
| Yes/No | 0.5750 | 0.7639 |
| Multiple Choice (4 options) | 0.8351 | 0.8353 |
| **[LayoutXLM](https://huggingface.co/microsoft/layoutxlm-base)** | | |
| Yes/No | 0.5403 | 0.7847 |
| Multiple Choice (4 options) | 0.8544 | 0.8543 |

## Example Usage

The tokenizer is an `AlbertTokenizer` (SentencePiece) inherited from
[nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese),
so the standard `LayoutLMv3Processor` cannot be used as-is. Instead,
combine `AutoTokenizer` with `LayoutLMv3ImageProcessor` and align each
subword's `bbox` with the source word.

Requirements: `transformers>=4.44`, `torch`, `Pillow`. `PyMuPDF`
(imported as `fitz`) is **optional** — only needed if you want to
render a PDF on the fly. If you just run the provided `sample.png`,
you do not need PyMuPDF. Words and bounding boxes can be produced by
any OCR engine that supports Japanese (e.g. PaddleOCR, Tesseract with
`jpn` traineddata, or manga-ocr). Boxes must be in pixel coordinates
`(x0, y0, x1, y1)` and normalized to LayoutLMv3's `0–1000` range.

A sample Japanese document image is provided as `sample.png` in this
repository for quick experimentation. The snippet below supports two
input paths: load `sample.png` directly, or render the first page of
any PDF via `fitz` (PyMuPDF) — uncomment the branch you want.

![sample document](sample.png)

```python
import io

import torch
from PIL import Image
from transformers import AutoTokenizer, LayoutLMv3ImageProcessor, LayoutLMv3Model

REPO = "llm-jp/layoutlmv3-japanese-preview"

tokenizer = AutoTokenizer.from_pretrained(REPO)
image_processor = LayoutLMv3ImageProcessor(apply_ocr=False)
model = LayoutLMv3Model.from_pretrained(REPO).eval()

# --- Option A: use the bundled sample.png (no PyMuPDF required) ---
image = Image.open("sample.png").convert("RGB")

# --- Option B: render the first page of any PDF via fitz (PyMuPDF) ---
# import fitz
# doc = fitz.open("your_document.pdf")
# pix = doc[0].get_pixmap(dpi=200)
# image = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB")

width, height = image.size

# Replace with your OCR output. Each box is (x0, y0, x1, y1) in pixel coords.
words = ["石巻市駅前北通り", "災害公営住宅", "完成資料"]
boxes = [
    (160,  28, 420, 70),
    (430,  28, 610, 70),
    (625,  28, 770, 70),
]

def normalize(box, w, h):
    x0, y0, x1, y1 = box
    return [
        int(1000 * x0 / w), int(1000 * y0 / h),
        int(1000 * x1 / w), int(1000 * y1 / h),
    ]

# Tokenize each word and assign the word's bbox to all of its subwords.
input_ids = [tokenizer.cls_token_id]
bbox      = [[0, 0, 0, 0]]
for word, box in zip(words, [normalize(b, width, height) for b in boxes]):
    ids = tokenizer(word, add_special_tokens=False)["input_ids"]
    input_ids += ids
    bbox      += [box] * len(ids)
input_ids.append(tokenizer.sep_token_id)
bbox.append([1000, 1000, 1000, 1000])

input_ids      = torch.tensor([input_ids])
bbox           = torch.tensor([bbox])
attention_mask = torch.ones_like(input_ids)
pixel_values   = image_processor(images=image, return_tensors="pt")["pixel_values"]

with torch.no_grad():
    outputs = model(
        input_ids=input_ids,
        bbox=bbox,
        attention_mask=attention_mask,
        pixel_values=pixel_values,
    )

# (batch, num_text_tokens + num_image_patches, hidden_size) — e.g. (1, 213, 768)
print(outputs.last_hidden_state.shape)
```

For downstream tasks (token classification, QA, etc.), swap `LayoutLMv3Model`
for the corresponding task head class such as
`LayoutLMv3ForTokenClassification` or `LayoutLMv3ForQuestionAnswering`
and fine-tune on your labeled data.

## License

This model is licensed under **Apache 2.0**.

### Pre-training Pipeline (Text & Layout Extraction)

For the NDL WARP PDF corpus, word-level text and bounding boxes were
extracted with [PyMuPDF (`fitz`)](https://pymupdf.readthedocs.io/) —
not a learned OCR model — to build `(word, bbox)` pairs used as the 1D
text + 2D layout inputs for pre-training. PyMuPDF was used strictly as
an **internal data-processing tool**: the library was not modified,
was not redistributed, and is neither embedded in nor linked to the
published model weights. Under AGPL v3, the copyleft obligations
(§5 "Conveying Modified Source Versions" and §13 "Remote Network
Interaction") attach only when the covered software itself is
*conveyed* or *served to users over a network*. Running PyMuPDF
locally to produce derived data (text strings and coordinate tuples),
and then training a separate model on that data, is a permitted use
and does not cause AGPL terms to propagate to the resulting model
weights. We therefore consider this usage to be license-compliant.
The weights distributed here remain Apache 2.0.

### Training Data and Legal Notice

- Pre-training data: NDL WARP (Japanese web archive), PubLayNet ([CDLA-Permissive-1.0](https://cdla.dev/permissive-1-0/)), DocLayNet ([CDLA-Permissive-1.0](https://cdla.dev/permissive-1-0/))

The document images in PubLayNet originate from the PMC Open Access
Commercial Use Collection, which includes articles under CC0, CC BY,
CC BY-SA, and CC BY-ND licenses. We used these documents for model training under the application of Article 30-4 of the Japanese Copyright Law (2026).

Users outside Japan should assess the applicability of their
local copyright exceptions when using this model.