---
license: apache-2.0
pipeline_tag: image-to-text
library_name: pytorch
language:
  - en
tags:
  - scene-text-recognition
  - STR
  - OCR
  - artistic-text
  - wordart
  - WATERec
---

# WATERec-Models: Strong Baseline for WordArt-Oriented Scene Text Recognition

**WATERec** is the strong STR baseline proposed in the paper **"Advancing WordArt-Oriented Scene
Text Recognition: Datasets and Methods" (ECCV 2026)**. It couples a **NaViT-like RoPE-ViT encoder**
that supports **arbitrary-shaped inputs** with an **autoregressive (AR) Transformer decoder**,
structurally breaking the bottleneck of fixed-template STR on highly irregular WordArt.

This repository hosts the trained model checkpoints.

- 📄 **Paper (arXiv):** https://arxiv.org/abs/2606.24484
- 💻 **Code:** https://github.com/YesianRohn/WATER
- 🧠 **Model code (OpenOCR-WATERec):** https://github.com/YesianRohn/OpenOCR-WATERec
- 📦 **Datasets (WATER-Data):** https://huggingface.co/datasets/Yesianrohn/WATER-Data

---

## Model Architecture

- **Encoder:** 6-layer Transformer with **RoPE attention**, accepting arbitrary aspect ratios.
  Inputs are rescaled (aspect-ratio preserving) so the number of `4×4` patch tokens lies in
  `[64, 256]`; tokens are projected to `d=384` and arranged in row-major order.
- **Decoder:** 2 cross-attention AR Transformer layers, predicting characters one by one under
  cross-entropy loss. Max text length 25; character set of 94 tokens (digits, letters, common
  symbols).

This design preserves native aspect ratios, mitigates distortion from fixed-template resizing, and
better adapts to curved / vertical / multi-oriented artistic layouts.

---

## Checkpoints

Each file is a standard PyTorch `state_dict` (~112 MB), differing only in the **training data**:

| File | Training data | WordArt-Bench Acc. |
|------|---------------|--------------------|
| `WATERec-R.pth` | WATER-R (real only, 3.2M) | 88.55% |
| `WATERec-S.pth` | WATER-S (synthetic only, 2M) | 80.94% |
| `WATERec-RS.pth` | WATER-R + WATER-S (real + 2M synthetic) | **90.40%** |

`WATERec-RS.pth` is the recommended best model — the first result to exceed 90% on WordArt-Bench,
surpassing both general-purpose and OCR-specialized VLMs by a large margin.

---

## Usage

We recommend running these checkpoints with the official framework
[OpenOCR-WATERec](https://github.com/YesianRohn/OpenOCR-WATERec), which provides the matching model
configuration, preprocessing, and inference scripts.

Download the weights:

```bash
# Requires: pip install -U "huggingface_hub[cli]"
hf download Yesianrohn/WATERec-Models --local-dir ./WATERec-Models
```

Load a checkpoint:

```python
import torch

# weights_only=True for safer loading of pickle-based .pth files
state_dict = torch.load("WATERec-RS.pth", map_location="cpu", weights_only=True)
# Build the WATERec model from the OpenOCR-WATERec config, then:
# model.load_state_dict(state_dict)
```

> These `.pth` files contain only model weights; no config is bundled. Use the configs in the
> OpenOCR-WATERec repository to instantiate the architecture before loading the state dict.

---

## License

Released under the **Apache 2.0** license.

---

## Citation

If you use these models in your research, please cite our paper:

```bibtex
@inproceedings{water2026eccv,
  title     = {Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods},
  author    = {Ye, Xingsong and Du, Yongkun and Zhang, Jiaxin and Zhang, Haojie and Sun, Chong and Li, Chen and Lyu, Jing and Chen, Zhineng},
  booktitle = {European Conference on Computer Vision (ECCV)},
  year      = {2026}
}
```