waon-siglip2-base-patch16-256

---
license: apache-2.0
tags:
- vision
library_name: transformers
pipeline_tag: zero-shot-image-classification
datasets:
- llm-jp/WAON
base_model:
- google/siglip2-base-patch16-256
---


<div align="center" style="line-height: 1;">
<h1>waon-siglip2-base-patch16-256</h1> 


  |
  <a href="https://huggingface.co/collections/llm-jp/waon" target="_blank">🤗 HuggingFace</a>
  &nbsp;|
  <a href="https://arxiv.org/abs/2510.22276" target="_blank">📄 Paper</a>
  &nbsp;|
  <a href="https://github.com/llm-jp/WAON" target="_blank">🧑‍💻 Code</a>
  &nbsp;|

  <br/>

</div>

We fine-tuned the [google/siglip2-base-patch16-256](https://huggingface.co/google/siglip2-base-patch16-256) on [WAON](https://huggingface.co/datasets/llm-jp/WAON), a large-scale Japanese image-text pair dataset.
Our model achieves state-of-the-art performance on [WAON-Bench](https://huggingface.co/datasets/llm-jp/WAON-Bench), a Japanese cultural image classification benchmark.


## Evaluation
Our model achieves the best score on Japanese benchmarks (Recruit and WAON-Bench).
| Model | Params | XM3600 | ImageNet | Recruit | WAON-Bench | Avg |
|-------|--------|---------------------------|------------------|---------|------------|-----|
| **siglip2-base-patch16-256 (fine-tuned on WAON)** | 375M | 73.75 | 49.61 | **83.14** | **94.97** | **75.37** |
| siglip2-base-patch16-256 (fine-tuned on ReLAION) | 375M | 72.39 | 47.38 | 81.65 | 92.99 | 73.60 |
| [siglip2-base-patch16-256](https://huggingface.co/google/siglip2-base-patch16-256)| 375M | 38.28 | 48.12 | 76.98 | 87.81 | 62.80 |
| [clip-japanese-base](https://huggingface.co/line-corporation/clip-japanese-base) | 196M | **78.00** | 48.90 | 81.65 | 90.05 | 74.65 |
| [siglip-base-patch16-256-mult](https://huggingface.co/google/siglip-base-patch16-256-multilingual) | 371M | 43.22 | 53.26 | 75.10 | 89.25 | 65.21 |
| [Japanese Stable CLIP ViT-L-16](https://huggingface.co/stabilityai/japanese-stable-clip-vit-l-16) | 414M | 66.03 | **55.97** | 71.29 | 82.03 | 68.83 |
| [LAION-CLIP-ViT-H-14](https://huggingface.co/laion/CLIP-ViT-H-14-frozen-xlm-roberta-large-laion5B-s13B-b90k) | 1193M | 72.64 | 47.67 | 70.62 | 85.88 | 69.20 |


## How to Use

Here is a sample code snippet for zero-shot image classification:
```python
import torch
import requests
from PIL import Image
from transformers import AutoProcessor, AutoModel

ckpt = "llm-jp/waon-siglip2-base-patch16-256"
model = AutoModel.from_pretrained(ckpt)
processor = AutoProcessor.from_pretrained(ckpt)

url = "https://upload.wikimedia.org/wikipedia/commons/5/58/Shiba_inu_taiki.jpg"
image = Image.open(requests.get(url, stream=True, headers={"User-Agent": "Mozilla/5.0"}).raw).convert("RGB")
candidate_labels = ["柴犬", "日本猫", "いわし"]

# IMPORTANT: we pass `padding=max_length` and `max_length=64` since the model was trained with this
inputs = processor(text=candidate_labels, images=image, padding="max_length", max_length=64, return_tensors="pt").to(model.device)

with torch.no_grad():
    outputs = model(**inputs)

logits_per_image = outputs.logits_per_image
probs = torch.sigmoid(logits_per_image)
for i, label in enumerate(candidate_labels):
    print(f"prob that image is '{label}': {probs[0][i]:.2%}")
# prob that image is '柴犬': 96.57%
# prob that image is '日本猫': 0.03%
# prob that image is 'いわし': 0.00%
```


For more information, please read the [SigLIP2 documentation](https://huggingface.co/docs/transformers/en/model_doc/siglip2).


## Citation
```bibtex
@misc{sugiura2025waonlargescalehighqualityjapanese,
      title={WAON: Large-Scale and High-Quality Japanese Image-Text Pair Dataset for Vision-Language Models}, 
      author={Issa Sugiura and Shuhei Kurita and Yusuke Oda and Daisuke Kawahara and Yasuo Okabe and Naoaki Okazaki},
      year={2025},
      eprint={2510.22276},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2510.22276}, 
}
```