--- license: apache-2.0 pipeline_tag: image-to-text library_name: pytorch language: - en tags: - scene-text-recognition - STR - OCR - artistic-text - wordart - WATERec --- # WATERec-Models: Strong Baseline for WordArt-Oriented Scene Text Recognition **WATERec** is the strong STR baseline proposed in the paper **"Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods" (ECCV 2026)**. It couples a **NaViT-like RoPE-ViT encoder** that supports **arbitrary-shaped inputs** with an **autoregressive (AR) Transformer decoder**, structurally breaking the bottleneck of fixed-template STR on highly irregular WordArt. This repository hosts the trained model checkpoints. - 📄 **Paper (arXiv):** https://arxiv.org/abs/2606.24484 - 💻 **Code:** https://github.com/YesianRohn/WATER - 🧠 **Model code (OpenOCR-WATERec):** https://github.com/YesianRohn/OpenOCR-WATERec - 📦 **Datasets (WATER-Data):** https://huggingface.co/datasets/Yesianrohn/WATER-Data --- ## Model Architecture - **Encoder:** 6-layer Transformer with **RoPE attention**, accepting arbitrary aspect ratios. Inputs are rescaled (aspect-ratio preserving) so the number of `4×4` patch tokens lies in `[64, 256]`; tokens are projected to `d=384` and arranged in row-major order. - **Decoder:** 2 cross-attention AR Transformer layers, predicting characters one by one under cross-entropy loss. Max text length 25; character set of 94 tokens (digits, letters, common symbols). This design preserves native aspect ratios, mitigates distortion from fixed-template resizing, and better adapts to curved / vertical / multi-oriented artistic layouts. --- ## Checkpoints Each file is a standard PyTorch `state_dict` (~112 MB), differing only in the **training data**: | File | Training data | WordArt-Bench Acc. | |------|---------------|--------------------| | `WATERec-R.pth` | WATER-R (real only, 3.2M) | 88.55% | | `WATERec-S.pth` | WATER-S (synthetic only, 2M) | 80.94% | | `WATERec-RS.pth` | WATER-R + WATER-S (real + 2M synthetic) | **90.40%** | `WATERec-RS.pth` is the recommended best model — the first result to exceed 90% on WordArt-Bench, surpassing both general-purpose and OCR-specialized VLMs by a large margin. --- ## Usage We recommend running these checkpoints with the official framework [OpenOCR-WATERec](https://github.com/YesianRohn/OpenOCR-WATERec), which provides the matching model configuration, preprocessing, and inference scripts. Download the weights: ```bash # Requires: pip install -U "huggingface_hub[cli]" hf download Yesianrohn/WATERec-Models --local-dir ./WATERec-Models ``` Load a checkpoint: ```python import torch # weights_only=True for safer loading of pickle-based .pth files state_dict = torch.load("WATERec-RS.pth", map_location="cpu", weights_only=True) # Build the WATERec model from the OpenOCR-WATERec config, then: # model.load_state_dict(state_dict) ``` > These `.pth` files contain only model weights; no config is bundled. Use the configs in the > OpenOCR-WATERec repository to instantiate the architecture before loading the state dict. --- ## License Released under the **Apache 2.0** license. --- ## Citation If you use these models in your research, please cite our paper: ```bibtex @inproceedings{water2026eccv, title = {Advancing WordArt-Oriented Scene Text Recognition: Datasets and Methods}, author = {Ye, Xingsong and Du, Yongkun and Zhang, Jiaxin and Zhang, Haojie and Sun, Chong and Li, Chen and Lyu, Jing and Chen, Zhineng}, booktitle = {European Conference on Computer Vision (ECCV)}, year = {2026} } ```