--- license: apache-2.0 language: - ja base_model: - nlp-waseda/roberta-base-japanese --- # LayoutLMv3-Japanese-preview LayoutLMv3-Japanese-preview is a multimodal pre-trained model for Japanese Document AI, built on the [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) architecture. The text tokenizer is replaced with a Japanese tokenizer from [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese), and the visual tokenizer adopts [BEiT v2](https://github.com/microsoft/unilm/tree/master/beit2) VQ-KD. The model is pre-trained on ~20M Japanese web pages from NDL WARP ,PubLayNet and DocLayNet for document layout analysis. ## Training Data ### Pre-training | Dataset | Scale | |---|---| | [NDL WARP](https://warp.da.ndl.go.jp/) | 20M+ pages | | [PubLayNet](https://github.com/ibm-aur-nlp/PubLayNet) | 360K+ images | | [DocLayNet](https://github.com/DS4SD/DocLayNet) | 80K+ images | ### Evaluation | Dataset | Task | Language | |---|---|---| | [FUNSD](https://guillaumejaume.github.io/FUNSD/) | Form Understanding | English | | [XFUND-JA](https://github.com/doc-analysis/XFUND) | Form Understanding | Japanese | | [JDocQA](https://github.com/mizuumi/JDocQA) | Document Visual Question Answering | Japanese | ### Tokenizer and Visual Tokenizer | Component | Source | License | |---|---|---| | Text Tokenizer | [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese) | Apache 2.0 | | Visual Tokenizer (VQ-KD) | [BEiT v2](https://github.com/microsoft/unilm/tree/master/beit2) | MIT | | PDF Text/BBox Extractor (pre-training) | [PyMuPDF (`fitz`)](https://pymupdf.readthedocs.io/) | AGPL v3 / commercial | ## Evaluation Results ### FUNSD (English Form Understanding) | Model | F1 (test) | |---|---| | [LayoutLMv3](https://huggingface.co/microsoft/layoutlmv3-base) | 0.9059 | | [LayoutLMv2](https://huggingface.co/microsoft/layoutlmv2-base-uncased) | 0.8276 | | LayoutLMv3-Japanese-preview (Ours) | 0.8284 | ### XFUND-JA (Japanese Form Understanding) | Model | F1 (test) | |---|---| | [LayoutXLM](https://huggingface.co/microsoft/layoutxlm-base) | 0.7921 | | LayoutLMv3-Japanese-preview (Ours) | 0.7436 | ### JDocQA (Japanese Document Visual Question Answering) | Task | F1 (test) | Accuracy (test) | |---|---|---| | **LayoutLMv3-Japanese-preview (Ours)** | | | | Yes/No | 0.5750 | 0.7639 | | Multiple Choice (4 options) | 0.8351 | 0.8353 | | **[LayoutXLM](https://huggingface.co/microsoft/layoutxlm-base)** | | | | Yes/No | 0.5403 | 0.7847 | | Multiple Choice (4 options) | 0.8544 | 0.8543 | ## Example Usage The tokenizer is an `AlbertTokenizer` (SentencePiece) inherited from [nlp-waseda/roberta-base-japanese](https://huggingface.co/nlp-waseda/roberta-base-japanese), so the standard `LayoutLMv3Processor` cannot be used as-is. Instead, combine `AutoTokenizer` with `LayoutLMv3ImageProcessor` and align each subword's `bbox` with the source word. Requirements: `transformers>=4.44`, `torch`, `Pillow`. `PyMuPDF` (imported as `fitz`) is **optional** — only needed if you want to render a PDF on the fly. If you just run the provided `sample.png`, you do not need PyMuPDF. Words and bounding boxes can be produced by any OCR engine that supports Japanese (e.g. PaddleOCR, Tesseract with `jpn` traineddata, or manga-ocr). Boxes must be in pixel coordinates `(x0, y0, x1, y1)` and normalized to LayoutLMv3's `0–1000` range. A sample Japanese document image is provided as `sample.png` in this repository for quick experimentation. The snippet below supports two input paths: load `sample.png` directly, or render the first page of any PDF via `fitz` (PyMuPDF) — uncomment the branch you want. ![sample document](sample.png) ```python import io import torch from PIL import Image from transformers import AutoTokenizer, LayoutLMv3ImageProcessor, LayoutLMv3Model REPO = "llm-jp/layoutlmv3-japanese-preview" tokenizer = AutoTokenizer.from_pretrained(REPO) image_processor = LayoutLMv3ImageProcessor(apply_ocr=False) model = LayoutLMv3Model.from_pretrained(REPO).eval() # --- Option A: use the bundled sample.png (no PyMuPDF required) --- image = Image.open("sample.png").convert("RGB") # --- Option B: render the first page of any PDF via fitz (PyMuPDF) --- # import fitz # doc = fitz.open("your_document.pdf") # pix = doc[0].get_pixmap(dpi=200) # image = Image.open(io.BytesIO(pix.tobytes("png"))).convert("RGB") width, height = image.size # Replace with your OCR output. Each box is (x0, y0, x1, y1) in pixel coords. words = ["石巻市駅前北通り", "災害公営住宅", "完成資料"] boxes = [ (160, 28, 420, 70), (430, 28, 610, 70), (625, 28, 770, 70), ] def normalize(box, w, h): x0, y0, x1, y1 = box return [ int(1000 * x0 / w), int(1000 * y0 / h), int(1000 * x1 / w), int(1000 * y1 / h), ] # Tokenize each word and assign the word's bbox to all of its subwords. input_ids = [tokenizer.cls_token_id] bbox = [[0, 0, 0, 0]] for word, box in zip(words, [normalize(b, width, height) for b in boxes]): ids = tokenizer(word, add_special_tokens=False)["input_ids"] input_ids += ids bbox += [box] * len(ids) input_ids.append(tokenizer.sep_token_id) bbox.append([1000, 1000, 1000, 1000]) input_ids = torch.tensor([input_ids]) bbox = torch.tensor([bbox]) attention_mask = torch.ones_like(input_ids) pixel_values = image_processor(images=image, return_tensors="pt")["pixel_values"] with torch.no_grad(): outputs = model( input_ids=input_ids, bbox=bbox, attention_mask=attention_mask, pixel_values=pixel_values, ) # (batch, num_text_tokens + num_image_patches, hidden_size) — e.g. (1, 213, 768) print(outputs.last_hidden_state.shape) ``` For downstream tasks (token classification, QA, etc.), swap `LayoutLMv3Model` for the corresponding task head class such as `LayoutLMv3ForTokenClassification` or `LayoutLMv3ForQuestionAnswering` and fine-tune on your labeled data. ## License This model is licensed under **Apache 2.0**. ### Pre-training Pipeline (Text & Layout Extraction) For the NDL WARP PDF corpus, word-level text and bounding boxes were extracted with [PyMuPDF (`fitz`)](https://pymupdf.readthedocs.io/) — not a learned OCR model — to build `(word, bbox)` pairs used as the 1D text + 2D layout inputs for pre-training. PyMuPDF was used strictly as an **internal data-processing tool**: the library was not modified, was not redistributed, and is neither embedded in nor linked to the published model weights. Under AGPL v3, the copyleft obligations (§5 "Conveying Modified Source Versions" and §13 "Remote Network Interaction") attach only when the covered software itself is *conveyed* or *served to users over a network*. Running PyMuPDF locally to produce derived data (text strings and coordinate tuples), and then training a separate model on that data, is a permitted use and does not cause AGPL terms to propagate to the resulting model weights. We therefore consider this usage to be license-compliant. The weights distributed here remain Apache 2.0. ### Training Data and Legal Notice - Pre-training data: NDL WARP (Japanese web archive), PubLayNet ([CDLA-Permissive-1.0](https://cdla.dev/permissive-1-0/)), DocLayNet ([CDLA-Permissive-1.0](https://cdla.dev/permissive-1-0/)) The document images in PubLayNet originate from the PMC Open Access Commercial Use Collection, which includes articles under CC0, CC BY, CC BY-SA, and CC BY-ND licenses. We used these documents for model training under the application of Article 30-4 of the Japanese Copyright Law (2026). Users outside Japan should assess the applicability of their local copyright exceptions when using this model.