Spaces:
Sleeping
title: Picarones
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
Picarones
OCR/HTR Benchmarking Platform for Heritage Documents
Picarones is an open-source platform for rigorously comparing OCR and HTR engines (Tesseract, Pero OCR, Kraken, cloud APIs...) and OCR+LLM pipelines on historical document corpora -- manuscripts, early printed books, and archives.
It provides heritage-specific metrics (diplomatic CER, ligature scores, medieval abbreviation handling), composable OCR+LLM pipelines, a fully templated interactive HTML report with a factual narrative synthesis, Friedman/Nemenyi multi-engine significance tests with a critical difference diagram, cost/speed Pareto analysis, a contextual glossary, an advanced customization panel, and multiple corpus import sources including IIIF, Gallica, HuggingFace, HTR-United, and eScriptorium.
Table of Contents
- Features
- Quick Start
- Installation
- Usage
- Supported Engines
- Normalization Profiles
- Error Taxonomy
- Project Structure
- Environment Variables
- CI/CD
- Development
- Roadmap
- Contributing
- License
Features
Heritage-Specific Metrics
- CER (Character Error Rate) in four variants: raw, NFC-normalized, caseless, and diplomatic (historical equivalences: long s = s, u = v, i = j, etc.)
- WER, MER, WIL with historical-aware tokenization (via jiwer)
- Unicode confusion matrix -- fingerprint each engine's character-level errors
- Ligature and diacritic scores -- track handling of fi, fl, ff, oe, ae, p-bar, and other medieval glyphs
- 10-class error taxonomy -- automatic classification of every error (visual confusion, abbreviation, segmentation, lacuna, over-normalization, etc.)
- Bootstrap 95% confidence intervals, Wilcoxon signed-rank tests, and the Friedman test + Nemenyi post-hoc with a Critical Difference Diagram (DemΕ‘ar 2006) for rigorous multi-engine comparison
- Intrinsic difficulty score per document, independent of engine performance
- Line-level error distribution with Gini coefficient and percentile analysis
- VLM hallucination detection -- anchor score and length ratio to flag fabricated output
- Cost / speed / carbon Pareto front (local vs cloud, per-token pricing model)
OCR+LLM Pipelines
- Composable chains:
tesseract -> gpt-4o,pero_ocr -> claude-sonnet, zero-shot VLM, etc. - Three pipeline modes: text-only post-correction, image+text post-correction, and zero-shot
- Over-normalization detection -- does the LLM silently modernize historical spellings?
- Versioned prompt library for medieval French, early modern French, medieval Latin, medieval English, and early modern English -- both correction and zero-shot variants
Corpus Import
| Source | Method |
|---|---|
| Local folder | picarones run --corpus ./corpus/ |
| IIIF manifests (Gallica, Bodleian, BL...) | picarones import iiif <manifest-url> |
| Gallica API (SRU + OCR) | GallicaClient / picarones import iiif |
| HuggingFace Datasets | picarones import hf <dataset-id> |
| HTR-United catalogue | picarones import htr-united |
| eScriptorium | EScriptoriumClient |
| ZIP upload (browser) | Web interface upload endpoint |
Supported corpus formats: plain text pairs (image + ground truth), ALTO XML, and PAGE XML.
Interactive HTML Report
- Self-contained HTML file -- works offline, no server needed (Jinja2-templated since Sprint 17)
- Factual narrative synthesis at the top of the report (Sprint 19): 12 deterministic detectors extract salient facts (global leader, significant gap, stratum collapse, VLM hallucination flag, speed winner, cost outlier, Pareto alternative, ...) and render them as short sentences -- every number is traceable to the source payload, no LLM, no hallucination risk
- Critical Difference Diagram (CDD) rendered server-side as static SVG -- no JS required
- Cost / speed / carbon Pareto chart with toggleable axes and highlighted Pareto front
- Contextual glossary: a
?icon next to every metric header opens a side panel with definition, what it measures, usage, limits, and reference (25 bilingual entries) - Advanced mode panel: visible-column picker, per-stratum filter, and opt-in personal composite score (sliders default to 0, formula always visible, explicit warning that no universal weighting exists). State is persisted in the URL.
- Sortable ranking table, radar charts, histograms (powered by Chart.js)
- Gallery view with dynamic filters and color-coded CER badges
- GitHub-style colored diff with synchronized N-way scrolling
- Triple diff view for OCR+LLM: ground truth / raw OCR / post-correction
- Unicode character view: interactive confusion matrix explorer
- Export to CSV, JSON, ALTO XML, PAGE XML, and annotated images
Longitudinal Tracking & Robustness
- Optional SQLite database to record benchmark history across runs
- CER evolution curves over time, per engine
- Automatic regression detection between consecutive runs
- Robustness analysis: measure engine resilience to noise, blur, rotation, resolution reduction, and binarization
- Critical degradation threshold identification
Web Interface
- FastAPI application with real-time Server-Sent Events (SSE) progress streaming
- Upload corpus as a ZIP file directly from the browser
- Dynamic engine and normalization profile selectors
- Browse and re-download generated HTML reports
- Bilingual French/English interface
- Deployable on HuggingFace Spaces (Docker, port 7860)
Quick Start
# Clone and install
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e .
# Install Tesseract (system binary, required for the Tesseract engine)
# Ubuntu/Debian
sudo apt install tesseract-ocr tesseract-ocr-fra tesseract-ocr-lat
# macOS
brew install tesseract
# Generate a demo report (no OCR engine needed)
picarones demo --output demo_report.html
# List available engines
picarones engines
# Run a benchmark
picarones run --corpus ./corpus/ --engines tesseract --output results.json
# Generate HTML report
picarones report --results results.json --output report.html
# Launch the web interface
picarones serve --port 8080
Installation
From Source
git clone https://github.com/maribakulj/Picarones.git
cd Picarones
pip install -e ".[dev,web]" # includes test and web dependencies
System requirements:
- Python >= 3.11
- Tesseract OCR 5 (for the Tesseract engine)
Docker
docker build -t picarones .
docker run -p 7860:7860 \
-e MISTRAL_API_KEY=... \
-e OPENAI_API_KEY=... \
picarones
The Docker image is based on Python 3.11-slim, includes Tesseract 5 with language packs
(fra, lat, eng, deu, ita, spa), and runs as a non-root user. A health check polls
/health every 30 seconds.
The HuggingFace Space uses this same Docker image.
Optional Extras
| Extra | Install command | What it adds |
|---|---|---|
dev |
pip install -e ".[dev]" |
pytest, pytest-cov, httpx, FastAPI, uvicorn, python-multipart |
web |
pip install -e ".[web]" |
FastAPI, uvicorn, python-multipart, httpx |
stats |
pip install -e ".[stats]" |
scipy (exact Wilcoxon/Friedman/Nemenyi -- otherwise pure-Python fallback) |
llm |
pip install -e ".[llm]" |
OpenAI, Anthropic, Mistral SDKs |
hf |
pip install -e ".[hf]" |
HuggingFace Datasets |
pero |
pip install -e ".[pero]" |
Pero OCR engine |
kraken |
pip install -e ".[kraken]" |
Kraken engine |
ocr-cloud |
pip install -e ".[ocr-cloud]" |
Google Vision, AWS (boto3), Azure Doc Intelligence |
all |
pip install -e ".[all]" |
web + hf + llm + dev (no ocr-cloud) |
See INSTALL.md for detailed instructions on Linux, macOS, Windows, and Docker.
Usage
CLI Commands
| Command | Description |
|---|---|
picarones run |
Run a full benchmark on a corpus |
picarones report |
Generate an HTML report from JSON results |
picarones demo |
Generate a demo report with synthetic data (no engine required) |
picarones metrics |
Calculate CER/WER between two text files |
picarones engines |
List all available OCR engines and LLM adapters |
picarones info |
Display version and system information |
picarones serve |
Launch the FastAPI web interface |
picarones history |
Query longitudinal benchmark history (SQLite) |
picarones robustness |
Run robustness analysis with degraded images |
picarones import iiif |
Import corpus from an IIIF manifest (Gallica, Bodleian, BL, BSB...). HTR-United and HuggingFace imports are exposed through the web interface (/api/htr-united/import, /api/huggingface/import). |
Examples:
# Benchmark with Tesseract, French language, PSM 6
picarones run --corpus ./manuscripts/ --engines tesseract --lang fra --psm 6 \
--output results.json --verbose
# Compare two text files
picarones metrics --reference ground_truth.txt --hypothesis ocr_output.txt
# Import 10 pages from a Gallica IIIF manifest
picarones import iiif https://gallica.bnf.fr/ark:/12148/xxx/manifest.json --pages 1-10
# HuggingFace and HTR-United imports are available via the web UI at
# http://localhost:8000/ (endpoints POST /api/huggingface/import and /api/htr-united/import)
# View benchmark history with regression detection
picarones history --engine tesseract --regression
# Robustness demo (noise, blur, rotation, resolution)
picarones robustness --corpus ./gt/ --engine tesseract --demo
# Fail CI if CER exceeds threshold
picarones run --corpus ./corpus/ --engines tesseract --fail-if-cer-above 0.15
Web Interface
picarones serve --host 0.0.0.0 --port 8080
API endpoints include:
| Endpoint | Method | Description |
|---|---|---|
/ |
GET | Main single-page application |
/api/status |
GET | Version and application status |
/api/engines |
GET | Available OCR/LLM engines |
/api/normalization/profiles |
GET | Normalization profiles (read dynamically) |
/api/benchmark/start |
POST | Start a benchmark job (returns job_id) |
/api/benchmark/{job_id}/stream |
GET | SSE real-time progress stream |
/api/benchmark/{job_id}/cancel |
POST | Cancel a running benchmark |
/api/corpus/browse |
GET | Browse server-side corpus folders |
/api/htr-united/catalogue |
GET | Browse HTR-United catalogue |
/api/huggingface/search |
GET | Search HuggingFace datasets |
/reports/{filename} |
GET | Download generated HTML reports |
Pipeline Modes
Picarones supports three modes for OCR+LLM pipelines:
| Mode | Description | Model type |
|---|---|---|
zero_shot |
LLM receives the image directly and transcribes without prior OCR | VLM (vision) |
post_correction_texte |
OCR produces raw text, then LLM corrects it | Text-only LLM |
post_correction_image_texte |
OCR produces raw text, then LLM receives both image and text for correction | VLM (vision) |
Example: ministral-3b-latest is a text-only model and should use post_correction_texte.
GPT-4o and Claude support all three modes.
Supported Engines
| Engine | Type | Execution Mode | Installation |
|---|---|---|---|
| Tesseract 5 | Local CLI | CPU (ProcessPool) | pip install pytesseract + system binary |
| Pero OCR | Local Python | CPU (ProcessPool) | pip install pero-ocr |
| Kraken | Local Python | CPU (ProcessPool) | pip install kraken |
| Mistral OCR | Cloud API | IO (ThreadPool) | MISTRAL_API_KEY env var |
| Google Vision | Cloud API | IO (ThreadPool) | GOOGLE_APPLICATION_CREDENTIALS env var |
| Azure Doc Intelligence | Cloud API | IO (ThreadPool) | AZURE_DOC_INTEL_ENDPOINT + AZURE_DOC_INTEL_KEY |
| GPT-4o (VLM) | LLM API | IO (ThreadPool) | OPENAI_API_KEY env var |
| Claude Sonnet (VLM) | LLM API | IO (ThreadPool) | ANTHROPIC_API_KEY env var |
| Mistral Large (LLM) | LLM API | IO (ThreadPool) | MISTRAL_API_KEY env var |
| Ollama (local LLM) | Local LLM | IO (ThreadPool) | ollama serve running locally |
| Custom engine | CLI or API | Configurable | YAML declaration, no code required |
Engines declare their execution_mode ("io" or "cpu"), allowing the runner to use
ThreadPoolExecutor for IO-bound engines and ProcessPoolExecutor for CPU-bound engines
simultaneously.
Normalization Profiles
Picarones ships 11 built-in normalization profiles designed for historical text comparison.
These reduce noise from expected orthographic variation so metrics reflect genuine OCR errors,
not historical spelling differences. The canonical list is defined in
picarones/core/normalization.py (NORMALIZATION_PROFILES)
and is exposed dynamically via /api/normalization/profiles.
| Profile | Period | Key equivalences |
|---|---|---|
nfc |
Any | Unicode NFC normalization only |
caseless |
Any | NFC + case folding (casefold) |
minimal |
Any | NFC + long s (ΕΏ -> s) |
medieval_french |
12th-15th c. | ΕΏ=s, u=v, i=j, y=i, Γ¦=ae, Ε=oe, κ=per, & = et |
early_modern_french |
16th-18th c. | ΕΏ=s, Γ¦=ae, Ε=oe |
medieval_latin |
12th-15th c. | ΕΏ=s, u=v, i=j, κ=per, κ=pro |
medieval_english |
12th-15th c. | ΕΏ=s, u=v, i=j, ΓΎ=th, Θ=y, κ=per, κ=pro |
early_modern_english |
16th-18th c. | ΕΏ=s, u=v, i=j, vv=w, ΓΎ=th, Γ°=th, Θ=y |
secretary_hand |
16th-17th c. | Early Modern English + secretary hand visual confusions |
sans_ponctuation |
Any | NFC + strips . , ; : ! ? ' " - β β ( ) [ ] |
sans_apostrophes |
Any | NFC + strips straight (') and typographic (β) apostrophes |
Custom profiles can be loaded from YAML files with user-defined diplomatic tables and/or
exclude_chars sets.
Error Taxonomy
Every character-level error is automatically classified into one of 10 categories:
| Class | Name | Description |
|---|---|---|
| 1 | visual_confusion |
Morphologically similar characters (rn/m, l/1, O/0, u/n) |
| 2 | diacritic_error |
Missing, incorrect, or spurious diacritical mark |
| 3 | case_error |
Case difference only (A/a) |
| 4 | ligature_error |
Ligature not resolved or incorrectly resolved |
| 5 | abbreviation_error |
Medieval abbreviation not expanded |
| 6 | hapax |
Word not found in any reference lexicon |
| 7 | segmentation_error |
Token fusion or fragmentation (words/lines) |
| 8 | oov_character |
Character outside the engine's vocabulary |
| 9 | lacuna |
Text present in ground truth but absent from OCR output |
| 10 | over_normalization |
LLM silently modernized a historical spelling |
Project Structure
picarones/
βββ __init__.py # Version (1.0.0), package metadata
βββ __main__.py # `python -m picarones`
βββ cli.py # Click CLI: run, demo, report, metrics, engines, info,
β # serve, import iiif, history, robustness
βββ fixtures.py # Realistic synthetic test data (medieval documents)
βββ i18n.py # Back-compat shim loading report/i18n/{fr,en}.json
β
βββ core/
β βββ corpus.py # Corpus loading (folder, ALTO XML, PAGE XML)
β βββ metrics.py # CER, WER, MER, WIL (via jiwer)
β βββ normalization.py # Unicode normalization, 11 diplomatic/exclusion profiles
β βββ statistics.py # Bootstrap CI, Wilcoxon, Friedman, Nemenyi, CDD SVG
β βββ runner.py # Benchmark orchestrator (ThreadPool + ProcessPool)
β βββ results.py # DocumentResult, BenchmarkResults, JSON export
β βββ confusion.py # Unicode confusion matrix
β βββ char_scores.py # Ligature and diacritic scores
β βββ taxonomy.py # 10-class error taxonomy
β βββ structure.py # Structural analysis (blocks, lines, words)
β βββ image_quality.py # Image quality metrics (contrast, noise, resolution)
β βββ difficulty.py # Intrinsic difficulty score per document
β βββ hallucination.py # VLM hallucination detection
β βββ line_metrics.py # Line-level error distribution (Gini, percentiles)
β βββ history.py # SQLite longitudinal tracking
β βββ robustness.py # Robustness analysis (noise, blur, rotation, resolution)
β βββ pricing.py # Cost model, EngineCost, Pareto front
β βββ narrative/ # Factual narrative engine (Sprint 16-19)
β βββ facts.py # Fact model, 12 FactType, DetectorRegistry
β βββ detectors.py # 12 detectors (global_leader_cer, significant_gap,
β β # stratum_winner/collapse, error_profile_outlier,
β β # llm_hallucination_flag, robustness_fragile,
β β # speed_winner, confidence_warning,
β β # statistical_tie, pareto_alternative, cost_outlier)
β βββ arbiter.py # Sort by importance, dedup, anti-contradiction
β βββ renderer.py # YAML template rendering via str.format_map
β βββ templates/{fr,en}.yaml
β
βββ data/
β βββ pricing.yaml # Indicative cost table (OCR local/cloud + LLM)
β
βββ engines/
β βββ base.py # BaseOCREngine (execution_mode: "io" | "cpu")
β βββ tesseract.py # Tesseract 5 adapter (CPU)
β βββ pero_ocr.py # Pero OCR adapter (CPU)
β βββ mistral_ocr.py # Mistral OCR API (/v1/ocr endpoint)
β βββ google_vision.py # Google Cloud Vision adapter
β βββ azure_doc_intel.py # Azure Document Intelligence adapter
β
βββ llm/
β βββ base.py # BaseLLMAdapter interface
β βββ openai_adapter.py # OpenAI / GPT-4o adapter
β βββ anthropic_adapter.py # Anthropic / Claude adapter
β βββ mistral_adapter.py # Mistral chat completions adapter
β βββ ollama_adapter.py # Ollama local LLM adapter
β
βββ pipelines/
β βββ base.py # OCRLLMPipeline orchestrator
β βββ over_normalization.py # Over-normalization detection
β
βββ prompts/ # 8 versioned prompt templates
β βββ correction_medieval_french.txt
β βββ correction_image_medieval_french.txt
β βββ correction_imprime_ancien.txt
β βββ correction_medieval_english.txt
β βββ correction_early_modern_english.txt
β βββ zero_shot_medieval_french.txt
β βββ zero_shot_imprime_ancien.txt
β βββ zero_shot_medieval_english.txt
β
βββ report/
β βββ generator.py # Orchestrates Jinja2 rendering (617 lines since Sprint 17)
β βββ diff_utils.py # Diff computation utilities
β βββ templates/ # Jinja2 partials (Sprint 17)
β β βββ base.html.j2 # assembles everything via {% include %}
β β βββ _header.html, _footer.html, _styles.css, _app.js
β β βββ _critical_difference.html, _narrative_summary.html, _side_panels.html
β β βββ view_ranking.html, view_gallery.html, view_document.html,
β β view_analyses.html, view_characters.html
β βββ i18n/ # FR/EN translations (Sprint 17 -- extracted from i18n.py)
β β βββ fr.json
β β βββ en.json
β βββ glossary/ # Contextual glossary (Sprint 21)
β β βββ fr.yaml # 25 bilingual entries (definition, measures, usage,
β β βββ en.yaml # limits, reference)
β βββ vendor/ # Vendored Chart.js
β
βββ web/
β βββ app.py # FastAPI app (SSE, ZIP upload, dynamic endpoints)
β βββ static/ # CSS assets
β
βββ importers/
βββ iiif.py # IIIF manifest importer
βββ gallica.py # Gallica API client (BnF)
βββ htr_united.py # HTR-United catalogue importer
βββ huggingface.py # HuggingFace Datasets importer
βββ escriptorium.py # eScriptorium client
docs/ # User + developer documentation (Sprint 22)
βββ case-studies/ # Two labelled case studies ("Cas d'Γ©cole")
β βββ 01-registres-paroissiaux.md
β βββ 02-edition-critique.md
βββ user/
β βββ reading-a-report.md # Anatomy, suggested reading order, advanced panel
βββ developer/
βββ index.md
βββ narrative-engine.md
βββ extending-glossary.md
βββ extending-i18n.md
tests/ # 1242 tests (1 skipped: scipy optional)
.github/workflows/
βββ ci.yml # CI: Python 3.11/3.12, Linux/macOS/Windows, ruff lint
βββ sync_to_huggingface.yml # Auto-sync to HuggingFace Space on push to main
Dockerfile # Multi-stage Docker build for HuggingFace Spaces
Environment Variables
Configure API keys depending on which engines and LLM adapters you use:
# LLM APIs
export OPENAI_API_KEY="sk-..."
export ANTHROPIC_API_KEY="sk-ant-..."
export MISTRAL_API_KEY="..."
# Cloud OCR APIs (optional)
export GOOGLE_APPLICATION_CREDENTIALS="/path/to/credentials.json"
export AWS_ACCESS_KEY_ID="..."
export AWS_SECRET_ACCESS_KEY="..."
export AWS_DEFAULT_REGION="eu-west-1"
export AZURE_DOC_INTEL_ENDPOINT="https://..."
export AZURE_DOC_INTEL_KEY="..."
For deployment on HuggingFace Spaces, set these in Settings > Variables and secrets.
CI/CD
GitHub Actions (ci.yml)
- Triggers: push to
main/develop/feature/*/sprint/*/claude/*, PRs tomain/develop, manual dispatch - Matrix: Python 3.11 + 3.12 on Linux, macOS, and Windows
- Jobs:
- Tests -- full pytest suite (1242 passing, 1 skipped when scipy is absent) with coverage uploaded to Codecov
- Demo -- end-to-end demo report generation with history and robustness
- Build -- wheel and sdist with twine validation
- Lint --
ruff check picarones/ tests/(E, W, F; ignores E501, E402). The ruff config lives inpyproject.tomlunder[tool.ruff]so CI,make lintand direct invocation all produce the same result -- blocking on F401 / E741.
HuggingFace Sync (sync_to_huggingface.yml)
- Automatically pushes
mainto the HuggingFace SpaceMa-Ri-Ba-Ku/Picarones - Requires the
HF_TOKENsecret in GitHub repository settings
Development
# Install with dev + web dependencies
pip install -e ".[dev,web]"
# Run the test suite
pytest tests/ -q --tb=short
# Run with coverage
pytest tests/ --cov=picarones --cov-report=term-missing
# Generate a demo report
picarones demo --output demo_report.html
# Launch the web UI in development mode
picarones serve --port 8080
# Full refresh (useful in Codespaces)
git pull && pip install -e ".[dev,web]" && picarones demo --output demo.html
Test suite: pytest tests/ -> 1242 passed, 1 skipped (the skip is intentional
when the optional scipy extra is not installed).
Key development conventions:
- Never use bare
except Exception: pass-- always log withlogger.warning() - Normalization profiles are read dynamically from
picarones/core/normalization.py-- never hardcode them in endpoint handlers - Engines declare their
execution_mode("io"or"cpu") so the runner can select the appropriate executor python-multipartmust remain in dependencies (FastAPI checks at import time)
Roadmap
| Sprint | Status | Deliverables |
|---|---|---|
| 1 | Done | Project structure, Tesseract, Pero OCR, CER/WER, CLI |
| 2 | Done | HTML report v1: Chart.js, colored diff, gallery |
| 3 | Done | OCR+LLM pipelines, GPT-4o, Claude, Mistral, Ollama |
| 4 | Done | Cloud OCR APIs, IIIF import, diplomatic normalization |
| 5 | Done | Advanced metrics: confusion matrix, ligatures, 9-class taxonomy |
| 6 | Done | FastAPI web interface, HTR-United, HuggingFace, bilingual UI |
| 7 | Done | HTML report v2: Wilcoxon, bootstrap, clustering, difficulty score |
| 8 | Done | eScriptorium, Gallica API, SQLite history, robustness analysis |
| 9 | Done | Documentation, packaging, Docker, CI/CD, PyInstaller, v1.0.0-Beta |
| 10 | Done | Line error distribution (Gini), VLM hallucination detection |
| 11 | Done | Internationalization FR/EN, English normalization profiles |
| 12 | Done | Browser ZIP upload, macOS file filtering, dynamic model selector |
| 13 | Done | pyproject.toml cleanup, runner parallelization, NDJSON streaming, Wilcoxon validation |
| 14 | Done | Robust engine filtering, corpus validation |
| 15 | Done | Fix empty OCR+LLM pipeline output (Mistral ContentChunk normalization, finish_reason logging) |
| 16 | Done | line_metrics + hallucination wired into runner/EngineReport; narrative engine foundations (core/narrative/ with Fact / DetectorRegistry); Pillow getdata->tobytes, silent excepts -> explicit warnings |
| 17 | Done | Report refactor: generator.py 3690 -> 617 lines via Jinja2; monolithic HTML template split into 10 files under picarones/report/templates/; i18n migrated to report/i18n/{fr,en}.json; +16 non-regression tests |
| 18 | Done | Friedman test + Nemenyi post-hoc + Critical Difference Diagram (DemΕ‘ar 2006); detect_statistical_tie enabled; SVG rendered server-side; +41 tests |
| 19 | Done | Factual narrative engine complete: 9 new detectors, arbiter (importance + anti-contradiction), YAML templates renderer, _narrative_summary.html partial, anti-hallucination traceability test; +32 tests |
| 20 | Done | Cost model + Pareto view: core/pricing.py + data/pricing.yaml, compute_pareto_front, Chart.js Pareto chart with cost/speed/carbon toggles, pareto_alternative and cost_outlier detectors; +28 tests |
| 21 | Done | Contextual glossary (25 bilingual entries) + advanced-mode side panel (visible columns, strata filters, opt-in composite score, URL state persistence); +21 tests |
| 22 | Done | Case studies (docs/case-studies/), user guide (docs/user/reading-a-report.md), three developer guides (docs/developer/); +18 tests |
Known Issues & Improvement Opportunities
This section captures the findings of the Sprint 22 audit. None of them block the current release (all 1242 tests pass, lint clean), but each represents a sensible next step.
Architecture / refactor
picarones/web/app.pyis 3072 lines (FastAPI routes, corpus upload, SSE, ZIP flattening, HTML delivery, model registry all in one module). Candidate split:app_routes.py/app_corpus.py/app_jobs.py/app_models.py.picarones/core/statistics.pyis 1127 lines mixing bootstrap CI, Wilcoxon, Friedman, Nemenyi table, Pareto front and CDD SVG. Splitting intostatistics/bootstrap.py,statistics/tests.py,statistics/pareto.py,statistics/cdd_svg.pywould shorten import graphs and ease review.picarones/cli.pyis 971 lines β each Click command could live in its own module underpicarones/cli/and be registered viacli.add_command(...).picarones/core/runner.pyis 847 lines β orchestrator is reasonable but edges past the 500-line guideline; extracting the per-document worker + the partial-NDJSON writer would reduce mental load.picarones/core/narrative/detectors.pyis 680 lines β all 12 detectors live together; one file perFactType(or per importance tier) would make additions safer.
Back-compat shim
picarones/i18n.pyis a 66-line shim that readspicarones/report/i18n/{fr,en}.json. Since Sprint 17 the JSON files are the source of truth and onlypicarones/report/generator.py:654still imports through the shim. Either promote the shim topicarones.report.i18n(renaming the import) or delete the file and import the loader directly.
Explicit engine declarations
MistralOCREngine,GoogleVisionEngineandAzureDocIntelEngineinherit the implicitexecution_mode = "io"default fromBaseOCREngine. For clarity and to protect against a future default flip, declare it explicitly (asTesseractEngineandPeroOCREnginealready do for"cpu").
Test coverage gaps
- No dedicated unit tests for
picarones/core/char_scores.py(exercised only transitively). - No unit tests for the cloud engine adapters themselves (
mistral_ocr.py,google_vision.py,azure_doc_intel.py) β they are only reached via integration fixtures. pytestinstalled as auvtool doesn't see project dependencies automatically; documentpip install -e ".[dev,web,stats]"in the pytest environment or switch to an in-repo venv to avoid "ModuleNotFoundError: No module named 'yaml'" surprises.
Documentation
CHANGELOG.mdstops at Sprint 9 (2025-03). Sprints 10-22 are described inCLAUDE.mdand this README but should be back-ported intoCHANGELOG.mdto follow Keep-a-Changelog.SPECS.mdpredates the narrative engine, Pareto view and glossary β worth a pass.- Some code comments and docstrings are still in French while user-facing strings are bilingual; harmonising module docstrings in English would make the project more contributor-friendly.
CI / packaging
sync_to_huggingface.ymlusesgit push --force hf mainunconditionally β safe today but worth documenting because a non-main branch push would silently rewrite the Space.picarones.spec(PyInstaller) is still present but not exercised in CI β either add abuild-exejob or mark the spec as community-maintained.
Security (nothing critical)
- ZIP upload flattening in
web/app.pyrejects absolute paths and..traversal but does not check for symlinks inside archives. Python'szipfiledoesn't extract symlinks, so the risk is theoretical; adding an explicit check (ZipInfo.external_attr & 0xA000) is a belt-and-braces improvement. - API keys are read from environment variables only (no hardcoded fallback) β good.
Contributing
See CONTRIBUTING.md for instructions on adding an OCR engine, an LLM adapter, or submitting a pull request.
License
Copyright 2024 Picarones contributors.