--- license: apache-2.0 tags: - vision library_name: transformers pipeline_tag: zero-shot-image-classification datasets: - llm-jp/WAON base_model: - google/siglip2-base-patch16-256 ---

waon-siglip2-base-patch16-256

| 🤗 HuggingFace  | 📄 Paper  | 🧑‍💻 Code  |
We fine-tuned the [google/siglip2-base-patch16-256](https://huggingface.co/google/siglip2-base-patch16-256) on [WAON](https://huggingface.co/datasets/llm-jp/WAON), a large-scale Japanese image-text pair dataset. Our model achieves state-of-the-art performance on [WAON-Bench](https://huggingface.co/datasets/llm-jp/WAON-Bench), a Japanese cultural image classification benchmark. ## Evaluation Our model achieves the best score on Japanese benchmarks (Recruit and WAON-Bench). | Model | Params | XM3600 | ImageNet | Recruit | WAON-Bench | Avg | |-------|--------|---------------------------|------------------|---------|------------|-----| | **siglip2-base-patch16-256 (fine-tuned on WAON)** | 375M | 73.75 | 49.61 | **83.14** | **94.97** | **75.37** | | siglip2-base-patch16-256 (fine-tuned on ReLAION) | 375M | 72.39 | 47.38 | 81.65 | 92.99 | 73.60 | | [siglip2-base-patch16-256](https://huggingface.co/google/siglip2-base-patch16-256)| 375M | 38.28 | 48.12 | 76.98 | 87.81 | 62.80 | | [clip-japanese-base](https://huggingface.co/line-corporation/clip-japanese-base) | 196M | **78.00** | 48.90 | 81.65 | 90.05 | 74.65 | | [siglip-base-patch16-256-mult](https://huggingface.co/google/siglip-base-patch16-256-multilingual) | 371M | 43.22 | 53.26 | 75.10 | 89.25 | 65.21 | | [Japanese Stable CLIP ViT-L-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) | 414M | 66.03 | **55.97** | 71.29 | 82.03 | 68.83 | | [LAION-CLIP-ViT-H-14](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 1193M | 72.64 | 47.67 | 70.62 | 85.88 | 69.20 | ## How to Use Here is a sample code snippet for zero-shot image classification: ```python import torch import requests from PIL import Image from transformers import AutoProcessor, AutoModel ckpt = "llm-jp/waon-siglip2-base-patch16-256" model = AutoModel.from_pretrained(ckpt) processor = AutoProcessor.from_pretrained(ckpt) url = "https://upload.wikimedia.org/wikipedia/commons/5/58/Shiba_inu_taiki.jpg" image = Image.open(requests.get(url, stream=True, headers={"User-Agent": "Mozilla/5.0"}).raw).convert("RGB") candidate_labels = ["柴犬", "日本猫", "いわし"] # IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this inputs = processor(text=candidate_labels, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device) with torch.no_grad(): outputs = model(**inputs) logits_per_image = outputs.logits_per_image probs = torch.sigmoid(logits_per_image) for i, label in enumerate(candidate_labels): print(f"prob that image is '{label}': {probs[0][i]:.2%}") # prob that image is '柴犬': 96.57% # prob that image is '日本猫': 0.03% # prob that image is 'いわし': 0.00% ``` For more information, please read the [SigLIP2 documentation](https://huggingface.co/docs/transformers/en/model_doc/siglip2). ## Citation ```bibtex @misc{sugiura2025waonlargescalehighqualityjapanese, title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models}, author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki}, year={2025}, eprint={2510.22276}, archivePrefix={arXiv}, primaryClass={cs.CV}, url={https://arxiv.org/abs/2510.22276}, } ```