Token Classification
Transformers
French
German
ocr_qa_assessment
ocr
bloomfilter
unigram
impresso
quality-assessment
v1.0.6
custom_code
Instructions to use impresso-project/ocr-quality-assessor-unigram-light with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use impresso-project/ocr-quality-assessor-unigram-light with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("token-classification", model="impresso-project/ocr-quality-assessor-unigram-light", trust_remote_code=True)# Load model directly from transformers import AutoModelForTokenClassification model = AutoModelForTokenClassification.from_pretrained("impresso-project/ocr-quality-assessor-unigram-light", trust_remote_code=True, dtype="auto") - Notebooks
- Google Colab
- Kaggle
| library_name: transformers | |
| language: | |
| - fr | |
| - de | |
| license: gpl-3.0 | |
| tags: | |
| - ocr | |
| - bloomfilter | |
| - unigram | |
| - impresso | |
| - quality-assessment | |
| - v1.0.6 | |
| # Model Card for `impresso-project/ocr-quality-assessor-unigram-light` | |
| ## Overview | |
| This model is a **lightweight OCR quality assessor** for historical French and German texts. It is a streamlined version of the original [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram), now accessible via a Hugging Face `pipeline` for convenient integration into downstream tasks. | |
| It uses **Bloom filters** containing known word unigrams to evaluate text quality by measuring the proportion of known vs. unknown words in OCR outputs. It is part of the [Impresso Project](https://impresso-project.ch), which develops tools for media archive processing and exploration. | |
| ## Model Details | |
| ### Model Description | |
| - **Developed by:** University of Zurich (UZH) from the [Impresso team](https://impresso-project.ch). The project is an interdisciplinary project focused on historical media analysis across languages, time, and modalities. Funded by the Swiss National Science Foundation ([CRSII5_173719](http://p3.snf.ch/project-173719), [CRSII5_213585](https://data.snf.ch/grants/grant/213585)) and the Luxembourg National Research Fund (grant No. 17498891). | |
| - **Model type:** Bloom filter–based scoring via a Transformers-compatible pipeline | |
| - **Languages:** French (fr), German (de) | |
| - **License:** GPL-3.0 | |
| - **Base resource:** [`impresso-project/OCR-quality-assessment-unigram`](https://huggingface.co/impresso-project/OCR-quality-assessment-unigram) | |
| - **Interface:** Hugging Face `transformers` pipeline | |
| - **Input format:** Raw text string | |
| - **Output format:** Float score representing OCR quality | |
| ## How to Use | |
| Be sure to have `torch`, `transformers` and `pybloom` installed: | |
| ```bash | |
| pip install pybloomfiltermmap3==0.6.0 | |
| ``` | |
| ```python | |
| from transformers import pipeline | |
| MODEL_NAME = "impresso-project/ocr-quality-assessor-unigram-light" | |
| ocrqa_pipeline = pipeline("ocr-qa-assessment", model=MODEL_NAME, | |
| trust_remote_code=True, | |
| device='cpu') | |
| sentence = """En l'an 1348, au plus fort des ravages de la peste noire à travers l'Europe, | |
| le Royaume de France se trouvait à la fois au bord du désespoir et face à une opportunité.""" | |
| score = ocrqa_pipeline(sentence) | |
| print(score) | |
| ``` | |
| ## Output Format | |
| Returns a single float value indicating the proportion of known tokens: | |
| ```python | |
| {'ocr_quality_score': 0.76} | |
| ``` | |
| ## Use Cases | |
| - OCR pipeline evaluation and quality diagnostics | |
| - Automated scoring of OCR segments or lines | |
| - Quick feedback in web-based transcription and correction tools | |
| ## Dataset and Preprocessing | |
| The Bloom filters used internally are derived from: | |
| - Wikipedia dumps (historical and modern) | |
| - Impresso-specific lexical resources | |
| Text normalization includes: | |
| - Unicode NFKC normalization | |
| - Digit masking (0) | |
| - Punctuation and symbol removal | |
| - Lowercasing | |
| ## Limitations | |
| - Currently supports only **French** and **German** | |
| - Does not provide spell correction suggestions | |
| - False positives are possible (due to the nature of Bloom filters) | |
| - Quality score is approximate and works best at the **segment** or **line** level | |
| ## Environmental Impact | |
| - **Hardware:** Standard laptop / CPU inference | |
| - **Training:** Reuse of existing Bloom filters; minimal additional compute | |
| - **Estimated Emissions:** < 0.01 kg CO₂eq | |
| ## Contact | |
| - Website: [https://impresso-project.ch](https://impresso-project.ch) | |
| <p align="center"> | |
| <img src="https://github.com/impresso/impresso.github.io/blob/master/assets/images/3x1--Yellow-Impresso-Black-on-White--transparent.png?raw=true" width="300" alt="Impresso Logo"/> | |
| </p> |