Spaces:
Sleeping
chantier4: workflows CLI dédiés + propagation fix Sprint 15 LLM + fusion Gallica→IIIF
Browse filesQuatrième chantier du plan d'évolution post-Sprint 97 — donner un
point d'entrée à chaque famille de profil (chantier 2) et nettoyer
les duplications transverses identifiées dans l'audit initial.
Sous-chantier 4.A — LLM adapters factorisés
-------------------------------------------
Avant : Sprint 15 (normalisation list[ContentChunk]→str) appliqué
seulement à Mistral. Logging discriminant par status_code (401/429/5xx)
dupliqué Mistral/OpenAI, complètement absent d'Anthropic.
Après : 2 helpers publics dans picarones/llm/base.py :
- ``normalize_llm_content(raw) -> str`` — gère les 4 formats observés
en production (str, None, list[ContentChunk avec .text], list[dict
avec key 'text']). Idempotent sur str.
- ``log_http_error(adapter_name, model, exc, env_var=None)`` — log
warning discriminé par status_code, mention de la variable
d'environnement à vérifier sur 401.
Les 4 adapters (Mistral, OpenAI, Anthropic, Ollama) :
- Déclarent ``api_key_env_var`` (None pour Ollama qui est local).
- Utilisent normalize_llm_content() sur la réponse SDK.
- Utilisent log_http_error() dans le except des appels API.
Bilan LLM : −60 lignes de duplication, comportement homogène, le fix
Sprint 15 appliqué uniformément.
Sous-chantier 4.B — Fusion Gallica→IIIF
---------------------------------------
Avant : ``_validate_url`` et le download HTTP dupliqués entre
gallica.py (lignes 125-155) et iiif.py (lignes 310-344). ~30 lignes
exactement identiques.
Après : nouveau module privé picarones/importers/_http.py qui expose
``validate_http_url`` et ``download_url`` (centralisé, retry
exponentiel configurable, garde-fou contre file:// / ftp:// /
javascript:). Gallica et IIIF y délèguent.
Pour la rétrocompat des tests Sprint 4 qui font
``from picarones.importers.iiif import _validate_url, _download_url``,
ces deux noms restent exposés depuis iiif.py comme alias de re-export.
Pas de suppression — le polite ``delay_between_requests`` BnF reste
spécifique à Gallica, le User-Agent custom reste configurable.
Sous-chantier 4.C — 3 sous-commandes CLI
----------------------------------------
Trois nouveaux workflows dédiés dans cli.py qui mappent les profils
du chantier 2 :
- ``picarones diagnose --corpus DIR`` → profil "diagnostics"
→ vue HTML « Diagnostic approfondi » (chantier 3) avec leviers,
profil d'image, baseline, longitudinal.
- ``picarones economics --corpus DIR`` → profil "economics"
→ vue HTML « Coût et performance » avec throughput effectif
(HTR-United 5 s/erreur).
- ``picarones edition --corpus DIR`` → profil "philological"
→ vue HTML « Taxonomie avancée » avec comparaison miroir
leader vs runner-up + 6 modules philologiques.
Helper ``_run_workflow(...)`` factorise la logique commune entre
les 4 commandes (run + 3 nouvelles) : chargement corpus,
instanciation moteurs, run_benchmark(profile=...), affichage
classement. ~80 lignes de duplication évitées sur les 3 nouvelles
commandes vs naive copy-paste.
Validation 7/7 en sandbox
-------------------------
- 4.A.1 : api_key_env_var déclaré sur Mistral/OpenAI/Anthropic
(=respective env var) et None sur Ollama.
- 4.A.2 : normalize_llm_content gère 4 formats (str/None/
list[ContentChunk]/list[dict]) + idempotence.
- 4.A.3 : aucun adapter ne réimplémente le pattern de log par status_code.
- 4.B.1 : iiif._validate_url IS _http.validate_http_url
(single source of truth confirmée par identité d'objet).
- 4.B.2 : Gallica._fetch_url contient bien l'import vers
_http.download_url.
- 4.B.3 : validate_http_url rejette file://, ftp://, javascript://, etc.
- 4.C : les 3 commandes diagnose/economics/edition sont enregistrées
dans le groupe Click avec le bon profile par défaut.
Tests
-----
+260 lignes dans tests/test_chantier4.py organisés en 4 classes :
TestNormalizeLlmContent (9 tests dont Sprint 15 fix), TestLogHttpError
(4 tests sur 401/429/5xx/générique), TestLlmAdaptersInheritEnvVar
(4 tests), TestHttpHelpers (5 tests dont parametrize sur 5 schémas
malicieux), TestIiifAliasesDelegateToHttp (rétrocompat tests Sprint 4),
TestGallicaDelegatesToHttp (anti-régression), TestCliWorkflows
(3 commandes + helper).
Verrou levé
-----------
Le fix Sprint 15 est désormais cohérent sur les 4 providers LLM.
La duplication Gallica/IIIF est résorbée. Les 3 nouveaux workflows
CLI mappent les profils du chantier 2 — un archiviste lance
``picarones edition`` au lieu de devoir mémoriser ``run --profile
philological``.
- picarones/cli.py +221 -0
- picarones/importers/_http.py +108 -0
- picarones/importers/gallica.py +27 -23
- picarones/importers/iiif.py +6 -35
- picarones/llm/anthropic_adapter.py +21 -8
- picarones/llm/base.py +128 -1
- picarones/llm/mistral_adapter.py +15 -34
- picarones/llm/ollama_adapter.py +5 -2
- picarones/llm/openai_adapter.py +16 -17
- tests/test_chantier4.py +277 -0
|
@@ -214,6 +214,227 @@ def run_cmd(
|
|
| 214 |
sys.exit(1)
|
| 215 |
|
| 216 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 217 |
# ---------------------------------------------------------------------------
|
| 218 |
# picarones metrics
|
| 219 |
# ---------------------------------------------------------------------------
|
|
|
|
| 214 |
sys.exit(1)
|
| 215 |
|
| 216 |
|
| 217 |
+
# ---------------------------------------------------------------------------
|
| 218 |
+
# Workflows CLI dédiés (chantier 4 post-Sprint 97)
|
| 219 |
+
# ---------------------------------------------------------------------------
|
| 220 |
+
#
|
| 221 |
+
# Chaque commande spécialisée fixe un profil de calcul (chantier 2) et
|
| 222 |
+
# émet un message identifiant la famille avant de déléguer au runner.
|
| 223 |
+
# L'option ``--profile`` reste disponible mais le défaut change pour
|
| 224 |
+
# chaque commande.
|
| 225 |
+
|
| 226 |
+
def _run_workflow(
|
| 227 |
+
*,
|
| 228 |
+
corpus: str,
|
| 229 |
+
engines: str,
|
| 230 |
+
output: str,
|
| 231 |
+
lang: str,
|
| 232 |
+
psm: int,
|
| 233 |
+
no_progress: bool,
|
| 234 |
+
verbose: bool,
|
| 235 |
+
profile: str,
|
| 236 |
+
workflow_label: str,
|
| 237 |
+
) -> None:
|
| 238 |
+
"""Implémentation commune des commandes ``run``, ``diagnose``,
|
| 239 |
+
``economics`` et ``edition``.
|
| 240 |
+
|
| 241 |
+
Les 4 commandes partagent le squelette : chargement corpus →
|
| 242 |
+
instanciation moteurs → ``run_benchmark(profile=...)`` → affichage
|
| 243 |
+
classement. Seul le profil par défaut et le message d'en-tête
|
| 244 |
+
diffèrent.
|
| 245 |
+
"""
|
| 246 |
+
_setup_logging(verbose)
|
| 247 |
+
|
| 248 |
+
from picarones.core.corpus import load_corpus_from_directory
|
| 249 |
+
from picarones.core.runner import run_benchmark
|
| 250 |
+
|
| 251 |
+
try:
|
| 252 |
+
corp = load_corpus_from_directory(corpus)
|
| 253 |
+
except (FileNotFoundError, ValueError) as exc:
|
| 254 |
+
click.echo(f"Erreur corpus : {exc}", err=True)
|
| 255 |
+
sys.exit(1)
|
| 256 |
+
|
| 257 |
+
click.echo(f"[{workflow_label}] Corpus '{corp.name}' — "
|
| 258 |
+
f"{len(corp)} documents chargés.")
|
| 259 |
+
|
| 260 |
+
engine_names = [e.strip() for e in engines.split(",") if e.strip()]
|
| 261 |
+
ocr_engines = []
|
| 262 |
+
for name in engine_names:
|
| 263 |
+
try:
|
| 264 |
+
engine = _engine_from_name(name, lang=lang, psm=psm)
|
| 265 |
+
ocr_engines.append(engine)
|
| 266 |
+
except click.BadParameter as exc:
|
| 267 |
+
click.echo(f"Erreur moteur : {exc}", err=True)
|
| 268 |
+
sys.exit(1)
|
| 269 |
+
|
| 270 |
+
if not ocr_engines:
|
| 271 |
+
click.echo("Aucun moteur valide spécifié.", err=True)
|
| 272 |
+
sys.exit(1)
|
| 273 |
+
|
| 274 |
+
click.echo(f"Moteurs : {', '.join(e.name for e in ocr_engines)}")
|
| 275 |
+
click.echo(f"Profil de métriques : {profile}")
|
| 276 |
+
|
| 277 |
+
result = run_benchmark(
|
| 278 |
+
corpus=corp,
|
| 279 |
+
engines=ocr_engines,
|
| 280 |
+
output_json=output,
|
| 281 |
+
show_progress=not no_progress,
|
| 282 |
+
profile=profile,
|
| 283 |
+
)
|
| 284 |
+
|
| 285 |
+
click.echo("\n── Classement ──────────────────────────────────")
|
| 286 |
+
for rank, entry in enumerate(result.ranking(), 1):
|
| 287 |
+
cer_pct = (
|
| 288 |
+
f"{entry['mean_cer'] * 100:.2f}%"
|
| 289 |
+
if entry["mean_cer"] is not None else "N/A"
|
| 290 |
+
)
|
| 291 |
+
wer_pct = (
|
| 292 |
+
f"{entry['mean_wer'] * 100:.2f}%"
|
| 293 |
+
if entry["mean_wer"] is not None else "N/A"
|
| 294 |
+
)
|
| 295 |
+
failed = entry["failed"]
|
| 296 |
+
failed_str = f" ({failed} erreur(s))" if failed else ""
|
| 297 |
+
click.echo(
|
| 298 |
+
f" {rank}. {entry['engine']:<20} "
|
| 299 |
+
f"CER={cer_pct:<8} WER={wer_pct}{failed_str}"
|
| 300 |
+
)
|
| 301 |
+
|
| 302 |
+
click.echo(f"\nRésultats écrits dans : {output}")
|
| 303 |
+
|
| 304 |
+
|
| 305 |
+
@cli.command("diagnose")
|
| 306 |
+
@click.option(
|
| 307 |
+
"--corpus", "-c", required=True,
|
| 308 |
+
type=click.Path(exists=True, file_okay=False, resolve_path=True),
|
| 309 |
+
help="Dossier contenant les paires image / .gt.txt",
|
| 310 |
+
)
|
| 311 |
+
@click.option(
|
| 312 |
+
"--engines", "-e", default="tesseract", show_default=True,
|
| 313 |
+
help="Liste de moteurs séparés par des virgules",
|
| 314 |
+
)
|
| 315 |
+
@click.option(
|
| 316 |
+
"--output", "-o", default="results_diagnose.json", show_default=True,
|
| 317 |
+
type=click.Path(resolve_path=True),
|
| 318 |
+
help="Fichier JSON de sortie",
|
| 319 |
+
)
|
| 320 |
+
@click.option("--lang", "-l", default="fra", show_default=True,
|
| 321 |
+
help="Code langue Tesseract")
|
| 322 |
+
@click.option("--psm", default=6, show_default=True,
|
| 323 |
+
help="Page Segmentation Mode Tesseract")
|
| 324 |
+
@click.option("--no-progress", is_flag=True, default=False,
|
| 325 |
+
help="Désactive la barre de progression")
|
| 326 |
+
@click.option("--verbose", "-v", is_flag=True, default=False,
|
| 327 |
+
help="Mode verbeux")
|
| 328 |
+
def diagnose_cmd(
|
| 329 |
+
corpus: str, engines: str, output: str, lang: str, psm: int,
|
| 330 |
+
no_progress: bool, verbose: bool,
|
| 331 |
+
) -> None:
|
| 332 |
+
"""Workflow diagnostic : bench + leviers d'amélioration + image_predictive.
|
| 333 |
+
|
| 334 |
+
Active le profil ``diagnostics`` (chantier 2) qui calcule les
|
| 335 |
+
métriques nécessaires à la vue HTML « Diagnostic approfondi »
|
| 336 |
+
(chantier 3) : leviers, profil d'image, baseline, longitudinal.
|
| 337 |
+
Idéal pour comprendre *pourquoi* un moteur produit ces résultats
|
| 338 |
+
sur ce corpus, pas seulement *quel CER*.
|
| 339 |
+
"""
|
| 340 |
+
_run_workflow(
|
| 341 |
+
corpus=corpus, engines=engines, output=output,
|
| 342 |
+
lang=lang, psm=psm,
|
| 343 |
+
no_progress=no_progress, verbose=verbose,
|
| 344 |
+
profile="diagnostics",
|
| 345 |
+
workflow_label="diagnose",
|
| 346 |
+
)
|
| 347 |
+
|
| 348 |
+
|
| 349 |
+
@cli.command("economics")
|
| 350 |
+
@click.option(
|
| 351 |
+
"--corpus", "-c", required=True,
|
| 352 |
+
type=click.Path(exists=True, file_okay=False, resolve_path=True),
|
| 353 |
+
help="Dossier contenant les paires image / .gt.txt",
|
| 354 |
+
)
|
| 355 |
+
@click.option(
|
| 356 |
+
"--engines", "-e", default="tesseract", show_default=True,
|
| 357 |
+
help="Liste de moteurs séparés par des virgules",
|
| 358 |
+
)
|
| 359 |
+
@click.option(
|
| 360 |
+
"--output", "-o", default="results_economics.json", show_default=True,
|
| 361 |
+
type=click.Path(resolve_path=True),
|
| 362 |
+
help="Fichier JSON de sortie",
|
| 363 |
+
)
|
| 364 |
+
@click.option("--lang", "-l", default="fra", show_default=True,
|
| 365 |
+
help="Code langue Tesseract")
|
| 366 |
+
@click.option("--psm", default=6, show_default=True,
|
| 367 |
+
help="Page Segmentation Mode Tesseract")
|
| 368 |
+
@click.option("--no-progress", is_flag=True, default=False,
|
| 369 |
+
help="Désactive la barre de progression")
|
| 370 |
+
@click.option("--verbose", "-v", is_flag=True, default=False,
|
| 371 |
+
help="Mode verbeux")
|
| 372 |
+
def economics_cmd(
|
| 373 |
+
corpus: str, engines: str, output: str, lang: str, psm: int,
|
| 374 |
+
no_progress: bool, verbose: bool,
|
| 375 |
+
) -> None:
|
| 376 |
+
"""Workflow économique : bench + throughput effectif + (cost projection).
|
| 377 |
+
|
| 378 |
+
Active le profil ``economics`` (chantier 2) qui se concentre sur
|
| 379 |
+
les métriques de décision budget : pages/h utilisable (intégrant
|
| 380 |
+
la correction humaine HTR-United à 5 s/erreur), coût marginal par
|
| 381 |
+
erreur évitée. La vue HTML « Coût et performance » (chantier 3)
|
| 382 |
+
est ensuite branchée.
|
| 383 |
+
"""
|
| 384 |
+
_run_workflow(
|
| 385 |
+
corpus=corpus, engines=engines, output=output,
|
| 386 |
+
lang=lang, psm=psm,
|
| 387 |
+
no_progress=no_progress, verbose=verbose,
|
| 388 |
+
profile="economics",
|
| 389 |
+
workflow_label="economics",
|
| 390 |
+
)
|
| 391 |
+
|
| 392 |
+
|
| 393 |
+
@cli.command("edition")
|
| 394 |
+
@click.option(
|
| 395 |
+
"--corpus", "-c", required=True,
|
| 396 |
+
type=click.Path(exists=True, file_okay=False, resolve_path=True),
|
| 397 |
+
help="Dossier contenant les paires image / .gt.txt",
|
| 398 |
+
)
|
| 399 |
+
@click.option(
|
| 400 |
+
"--engines", "-e", default="tesseract", show_default=True,
|
| 401 |
+
help="Liste de moteurs séparés par des virgules",
|
| 402 |
+
)
|
| 403 |
+
@click.option(
|
| 404 |
+
"--output", "-o", default="results_edition.json", show_default=True,
|
| 405 |
+
type=click.Path(resolve_path=True),
|
| 406 |
+
help="Fichier JSON de sortie",
|
| 407 |
+
)
|
| 408 |
+
@click.option("--lang", "-l", default="fra", show_default=True,
|
| 409 |
+
help="Code langue Tesseract")
|
| 410 |
+
@click.option("--psm", default=6, show_default=True,
|
| 411 |
+
help="Page Segmentation Mode Tesseract")
|
| 412 |
+
@click.option("--no-progress", is_flag=True, default=False,
|
| 413 |
+
help="Désactive la barre de progression")
|
| 414 |
+
@click.option("--verbose", "-v", is_flag=True, default=False,
|
| 415 |
+
help="Mode verbeux")
|
| 416 |
+
def edition_cmd(
|
| 417 |
+
corpus: str, engines: str, output: str, lang: str, psm: int,
|
| 418 |
+
no_progress: bool, verbose: bool,
|
| 419 |
+
) -> None:
|
| 420 |
+
"""Workflow édition critique : bench + métriques philologiques.
|
| 421 |
+
|
| 422 |
+
Active le profil ``philological`` (chantier 2) qui inclut les
|
| 423 |
+
modules philologiques (unicode_blocks, abbreviations, MUFI,
|
| 424 |
+
early_modern_typography, modern_archives, roman_numerals) et la
|
| 425 |
+
vue HTML « Taxonomie avancée » (chantier 3) avec comparaison
|
| 426 |
+
miroir leader vs runner-up. Cible : éditeurs de chartes,
|
| 427 |
+
paléographes, archivistes.
|
| 428 |
+
"""
|
| 429 |
+
_run_workflow(
|
| 430 |
+
corpus=corpus, engines=engines, output=output,
|
| 431 |
+
lang=lang, psm=psm,
|
| 432 |
+
no_progress=no_progress, verbose=verbose,
|
| 433 |
+
profile="philological",
|
| 434 |
+
workflow_label="edition",
|
| 435 |
+
)
|
| 436 |
+
|
| 437 |
+
|
| 438 |
# ---------------------------------------------------------------------------
|
| 439 |
# picarones metrics
|
| 440 |
# ---------------------------------------------------------------------------
|
|
@@ -0,0 +1,108 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Helpers HTTP partagés par les importeurs IIIF / Gallica / HTR-United.
|
| 2 |
+
|
| 3 |
+
Chantier 4 du plan d'évolution post-Sprint 97 — fusion Gallica vers IIIF.
|
| 4 |
+
|
| 5 |
+
Auparavant les fonctions ``_validate_url`` et ``_download_url`` étaient
|
| 6 |
+
dupliquées entre :mod:`picarones.importers.iiif` (lignes 310-344) et
|
| 7 |
+
:mod:`picarones.importers.gallica` (lignes 125-155). Le module Gallica
|
| 8 |
+
faisait 549 lignes dont une bonne partie réimplémentait les mêmes
|
| 9 |
+
abstractions HTTP que IIIF (validation de schéma, retry exponentiel,
|
| 10 |
+
gestion des codes HTTP).
|
| 11 |
+
|
| 12 |
+
Ce module privé centralise ces helpers. Les deux importeurs (et tout
|
| 13 |
+
nouveau importateur HTTP futur) les utilisent. Comportement public
|
| 14 |
+
inchangé — uniquement de la factorisation.
|
| 15 |
+
"""
|
| 16 |
+
|
| 17 |
+
from __future__ import annotations
|
| 18 |
+
|
| 19 |
+
import logging
|
| 20 |
+
import time
|
| 21 |
+
import urllib.error
|
| 22 |
+
import urllib.request
|
| 23 |
+
from typing import Optional
|
| 24 |
+
from urllib.parse import urlparse
|
| 25 |
+
|
| 26 |
+
logger = logging.getLogger(__name__)
|
| 27 |
+
|
| 28 |
+
_DEFAULT_USER_AGENT = (
|
| 29 |
+
"Picarones/1.0 (OCR benchmark platform; "
|
| 30 |
+
"https://github.com/maribakulj/Picarones)"
|
| 31 |
+
)
|
| 32 |
+
|
| 33 |
+
|
| 34 |
+
def validate_http_url(url: str) -> None:
|
| 35 |
+
"""Lève ``ValueError`` si le schéma de l'URL n'est pas http/https.
|
| 36 |
+
|
| 37 |
+
Garde-fou contre les URLs ``file://``, ``ftp://``, ``data:`` qui
|
| 38 |
+
permettraient à un manifeste IIIF malveillant de lire des fichiers
|
| 39 |
+
locaux ou de contourner la politique réseau.
|
| 40 |
+
"""
|
| 41 |
+
parsed = urlparse(url)
|
| 42 |
+
if parsed.scheme not in ("http", "https"):
|
| 43 |
+
raise ValueError(
|
| 44 |
+
f"Schéma URL non autorisé '{parsed.scheme}' "
|
| 45 |
+
f"(seuls http/https sont acceptés) : {url}"
|
| 46 |
+
)
|
| 47 |
+
|
| 48 |
+
|
| 49 |
+
def download_url(
|
| 50 |
+
url: str,
|
| 51 |
+
*,
|
| 52 |
+
retries: int = 4,
|
| 53 |
+
backoff: float = 2.0,
|
| 54 |
+
timeout: int = 60,
|
| 55 |
+
user_agent: str = _DEFAULT_USER_AGENT,
|
| 56 |
+
extra_headers: Optional[dict[str, str]] = None,
|
| 57 |
+
) -> bytes:
|
| 58 |
+
"""Télécharge une URL avec retry exponentiel.
|
| 59 |
+
|
| 60 |
+
Parameters
|
| 61 |
+
----------
|
| 62 |
+
url:
|
| 63 |
+
URL à télécharger. Validée par :func:`validate_http_url`.
|
| 64 |
+
retries:
|
| 65 |
+
Nombre total de tentatives (défaut 4).
|
| 66 |
+
backoff:
|
| 67 |
+
Base du backoff exponentiel : attente = ``backoff ** attempt``
|
| 68 |
+
secondes (défaut 2.0 → 0, 2, 4, 8 s).
|
| 69 |
+
timeout:
|
| 70 |
+
Timeout HTTP par tentative en secondes (défaut 60).
|
| 71 |
+
user_agent:
|
| 72 |
+
Header ``User-Agent`` envoyé. Défaut : Picarones identifié.
|
| 73 |
+
extra_headers:
|
| 74 |
+
Headers supplémentaires (ex : ``{"Accept": "application/json"}``).
|
| 75 |
+
|
| 76 |
+
Raises
|
| 77 |
+
------
|
| 78 |
+
ValueError
|
| 79 |
+
Si l'URL n'a pas un schéma autorisé.
|
| 80 |
+
RuntimeError
|
| 81 |
+
Si toutes les tentatives échouent.
|
| 82 |
+
"""
|
| 83 |
+
validate_http_url(url)
|
| 84 |
+
headers = {"User-Agent": user_agent}
|
| 85 |
+
if extra_headers:
|
| 86 |
+
headers.update(extra_headers)
|
| 87 |
+
last_exc: Optional[Exception] = None
|
| 88 |
+
for attempt in range(retries):
|
| 89 |
+
if attempt > 0:
|
| 90 |
+
wait = backoff ** attempt
|
| 91 |
+
logger.debug(
|
| 92 |
+
"Retry %d/%d dans %.1fs — %s",
|
| 93 |
+
attempt, retries - 1, wait, url,
|
| 94 |
+
)
|
| 95 |
+
time.sleep(wait)
|
| 96 |
+
try:
|
| 97 |
+
req = urllib.request.Request(url, headers=headers)
|
| 98 |
+
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 99 |
+
return resp.read()
|
| 100 |
+
except (urllib.error.URLError, urllib.error.HTTPError) as exc:
|
| 101 |
+
last_exc = exc
|
| 102 |
+
logger.warning("Erreur téléchargement %s : %s", url, exc)
|
| 103 |
+
raise RuntimeError(
|
| 104 |
+
f"Impossible de télécharger {url} après {retries} tentatives",
|
| 105 |
+
) from last_exc
|
| 106 |
+
|
| 107 |
+
|
| 108 |
+
__all__ = ["validate_http_url", "download_url"]
|
|
@@ -122,34 +122,38 @@ class GallicaClient:
|
|
| 122 |
self.timeout = timeout
|
| 123 |
self.delay = delay_between_requests
|
| 124 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 125 |
@staticmethod
|
| 126 |
def _validate_url(url: str) -> None:
|
| 127 |
-
"""
|
| 128 |
-
from
|
| 129 |
-
|
| 130 |
-
if parsed.scheme not in ("http", "https"):
|
| 131 |
-
raise ValueError(
|
| 132 |
-
f"Schéma URL non autorisé '{parsed.scheme}' (seuls http/https sont acceptés) : {url}"
|
| 133 |
-
)
|
| 134 |
|
| 135 |
def _fetch_url(self, url: str) -> bytes:
|
| 136 |
-
"""Télécharge le contenu d'une URL.
|
| 137 |
-
|
| 138 |
-
|
| 139 |
-
|
| 140 |
-
|
| 141 |
-
|
|
|
|
| 142 |
try:
|
| 143 |
-
|
| 144 |
-
|
| 145 |
-
|
| 146 |
-
|
| 147 |
-
|
| 148 |
-
)
|
| 149 |
-
except
|
| 150 |
-
|
| 151 |
-
|
| 152 |
-
|
|
|
|
| 153 |
finally:
|
| 154 |
if self.delay > 0:
|
| 155 |
time.sleep(self.delay)
|
|
|
|
| 122 |
self.timeout = timeout
|
| 123 |
self.delay = delay_between_requests
|
| 124 |
|
| 125 |
+
# Chantier 4 (post-Sprint 97) — fusion Gallica → IIIF :
|
| 126 |
+
# ``_validate_url`` et le fetch HTTP sont désormais factorisés
|
| 127 |
+
# dans :mod:`picarones.importers._http`. Avant ce chantier ces
|
| 128 |
+
# 30 lignes étaient dupliquées avec :mod:`iiif`. Le polite
|
| 129 |
+
# ``delay_between_requests`` reste ici (spécifique à la BnF).
|
| 130 |
+
|
| 131 |
@staticmethod
|
| 132 |
def _validate_url(url: str) -> None:
|
| 133 |
+
"""Délègue à :func:`picarones.importers._http.validate_http_url`."""
|
| 134 |
+
from picarones.importers._http import validate_http_url
|
| 135 |
+
validate_http_url(url)
|
|
|
|
|
|
|
|
|
|
|
|
|
| 136 |
|
| 137 |
def _fetch_url(self, url: str) -> bytes:
|
| 138 |
+
"""Télécharge le contenu d'une URL avec respect du polite delay BnF.
|
| 139 |
+
|
| 140 |
+
Délègue à :func:`picarones.importers._http.download_url` puis
|
| 141 |
+
applique ``self.delay`` (par défaut 0.5 s) entre les requêtes
|
| 142 |
+
pour respecter les conditions d'utilisation Gallica.
|
| 143 |
+
"""
|
| 144 |
+
from picarones.importers._http import download_url
|
| 145 |
try:
|
| 146 |
+
return download_url(
|
| 147 |
+
url,
|
| 148 |
+
retries=1,
|
| 149 |
+
timeout=self.timeout,
|
| 150 |
+
user_agent="Picarones/1.0 (research tool)",
|
| 151 |
+
)
|
| 152 |
+
except RuntimeError as exc:
|
| 153 |
+
# Le helper retourne ``RuntimeError`` après retries épuisés.
|
| 154 |
+
# On re-emballe pour conserver le format de message historique
|
| 155 |
+
# attendu par les tests Gallica (« HTTP 404 sur ... »).
|
| 156 |
+
raise RuntimeError(str(exc)) from exc
|
| 157 |
finally:
|
| 158 |
if self.delay > 0:
|
| 159 |
time.sleep(self.delay)
|
|
@@ -307,41 +307,12 @@ def _extract_v3_transcription(canvas: dict) -> Optional[str]:
|
|
| 307 |
# Téléchargement avec retry
|
| 308 |
# ---------------------------------------------------------------------------
|
| 309 |
|
| 310 |
-
|
| 311 |
-
|
| 312 |
-
|
| 313 |
-
|
| 314 |
-
|
| 315 |
-
|
| 316 |
-
f"Schéma URL non autorisé '{parsed.scheme}' (seuls http/https sont acceptés) : {url}"
|
| 317 |
-
)
|
| 318 |
-
|
| 319 |
-
|
| 320 |
-
def _download_url(
|
| 321 |
-
url: str,
|
| 322 |
-
retries: int = 4,
|
| 323 |
-
backoff: float = 2.0,
|
| 324 |
-
timeout: int = 60,
|
| 325 |
-
) -> bytes:
|
| 326 |
-
"""Télécharge une URL avec retry exponentiel."""
|
| 327 |
-
_validate_url(url)
|
| 328 |
-
headers = {
|
| 329 |
-
"User-Agent": "Picarones/1.0 (OCR benchmark platform; https://github.com/maribakulj/Picarones)"
|
| 330 |
-
}
|
| 331 |
-
last_exc: Optional[Exception] = None
|
| 332 |
-
for attempt in range(retries):
|
| 333 |
-
if attempt > 0:
|
| 334 |
-
wait = backoff ** attempt
|
| 335 |
-
logger.debug("Retry %d/%d dans %.1fs — %s", attempt, retries - 1, wait, url)
|
| 336 |
-
time.sleep(wait)
|
| 337 |
-
try:
|
| 338 |
-
req = urllib.request.Request(url, headers=headers)
|
| 339 |
-
with urllib.request.urlopen(req, timeout=timeout) as resp:
|
| 340 |
-
return resp.read()
|
| 341 |
-
except (urllib.error.URLError, urllib.error.HTTPError) as exc:
|
| 342 |
-
last_exc = exc
|
| 343 |
-
logger.warning("Erreur téléchargement %s : %s", url, exc)
|
| 344 |
-
raise RuntimeError(f"Impossible de télécharger {url} après {retries} tentatives") from last_exc
|
| 345 |
|
| 346 |
|
| 347 |
def _fetch_manifest(url: str) -> dict:
|
|
|
|
| 307 |
# Téléchargement avec retry
|
| 308 |
# ---------------------------------------------------------------------------
|
| 309 |
|
| 310 |
+
# Chantier 4 (post-Sprint 97) — helpers HTTP factorisés dans
|
| 311 |
+
# :mod:`picarones.importers._http`. Ces noms restent disponibles
|
| 312 |
+
# depuis ``iiif`` (rétrocompat des tests qui les importent
|
| 313 |
+
# directement, ex. test_sprint4_normalization_iiif).
|
| 314 |
+
from picarones.importers._http import download_url as _download_url
|
| 315 |
+
from picarones.importers._http import validate_http_url as _validate_url
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 316 |
|
| 317 |
|
| 318 |
def _fetch_manifest(url: str) -> dict:
|
|
@@ -6,7 +6,11 @@ import logging
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
-
from picarones.llm.base import
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -19,6 +23,8 @@ class AnthropicAdapter(BaseLLMAdapter):
|
|
| 19 |
Modes supportés : text_only, text_and_image, zero_shot.
|
| 20 |
"""
|
| 21 |
|
|
|
|
|
|
|
| 22 |
@property
|
| 23 |
def name(self) -> str:
|
| 24 |
return "anthropic"
|
|
@@ -74,9 +80,12 @@ class AnthropicAdapter(BaseLLMAdapter):
|
|
| 74 |
messages=[{"role": "user", "content": content}],
|
| 75 |
)
|
| 76 |
except Exception as exc:
|
| 77 |
-
|
| 78 |
-
|
| 79 |
-
|
|
|
|
|
|
|
|
|
|
| 80 |
)
|
| 81 |
raise
|
| 82 |
|
|
@@ -87,12 +96,16 @@ class AnthropicAdapter(BaseLLMAdapter):
|
|
| 87 |
)
|
| 88 |
return ""
|
| 89 |
|
| 90 |
-
|
| 91 |
-
|
| 92 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 93 |
logger.warning(
|
| 94 |
"[AnthropicAdapter] bloc de type '%s' sans texte (modèle=%s).",
|
| 95 |
getattr(block, "type", "unknown"), self.model,
|
| 96 |
)
|
| 97 |
-
return ""
|
| 98 |
return text
|
|
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
+
from picarones.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
|
| 15 |
logger = logging.getLogger(__name__)
|
| 16 |
|
|
|
|
| 23 |
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
"""
|
| 25 |
|
| 26 |
+
api_key_env_var = "ANTHROPIC_API_KEY"
|
| 27 |
+
|
| 28 |
@property
|
| 29 |
def name(self) -> str:
|
| 30 |
return "anthropic"
|
|
|
|
| 80 |
messages=[{"role": "user", "content": content}],
|
| 81 |
)
|
| 82 |
except Exception as exc:
|
| 83 |
+
# Chantier 4 — log discriminant (401/429/5xx) factorisé.
|
| 84 |
+
# Auparavant Anthropic ne discriminait pas par code HTTP,
|
| 85 |
+
# difficile à diagnostiquer (clé invalide vs rate limit).
|
| 86 |
+
log_http_error(
|
| 87 |
+
"AnthropicAdapter", self.model, exc,
|
| 88 |
+
env_var=self.api_key_env_var,
|
| 89 |
)
|
| 90 |
raise
|
| 91 |
|
|
|
|
| 96 |
)
|
| 97 |
return ""
|
| 98 |
|
| 99 |
+
# Chantier 4 — propagation du fix Sprint 15 : le SDK Anthropic
|
| 100 |
+
# retourne ``response.content`` comme une liste de blocs
|
| 101 |
+
# (``ContentBlock`` avec attribut ``text``). ``normalize_llm_content``
|
| 102 |
+
# concatène le texte de tous les blocs au lieu de ne prendre que
|
| 103 |
+
# le premier — utile quand le modèle émet plusieurs blocs.
|
| 104 |
+
text = normalize_llm_content(response.content)
|
| 105 |
+
if not text:
|
| 106 |
+
block = response.content[0]
|
| 107 |
logger.warning(
|
| 108 |
"[AnthropicAdapter] bloc de type '%s' sans texte (modèle=%s).",
|
| 109 |
getattr(block, "type", "unknown"), self.model,
|
| 110 |
)
|
|
|
|
| 111 |
return text
|
|
@@ -6,7 +6,7 @@ import logging
|
|
| 6 |
import time
|
| 7 |
from abc import ABC, abstractmethod
|
| 8 |
from dataclasses import dataclass
|
| 9 |
-
from typing import Optional
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -39,6 +39,105 @@ def _is_retryable(exc: Exception) -> bool:
|
|
| 39 |
return False
|
| 40 |
|
| 41 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 42 |
@dataclass
|
| 43 |
class LLMResult:
|
| 44 |
"""Résultat produit par un appel LLM."""
|
|
@@ -69,8 +168,28 @@ class BaseLLMAdapter(ABC):
|
|
| 69 |
Les erreurs retryables (HTTP 429, 5xx, timeout réseau) sont automatiquement
|
| 70 |
retentées avec backoff exponentiel (2s, 4s, 8s par défaut). Configurable
|
| 71 |
via ``config["max_retries"]`` et ``config["retry_backoff"]``.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 72 |
"""
|
| 73 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 74 |
def __init__(
|
| 75 |
self,
|
| 76 |
model: Optional[str] = None,
|
|
@@ -150,3 +269,11 @@ class BaseLLMAdapter(ABC):
|
|
| 150 |
|
| 151 |
def __repr__(self) -> str:
|
| 152 |
return f"{self.__class__.__name__}(model={self.model!r})"
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
import time
|
| 7 |
from abc import ABC, abstractmethod
|
| 8 |
from dataclasses import dataclass
|
| 9 |
+
from typing import Any, Optional
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
|
|
| 39 |
return False
|
| 40 |
|
| 41 |
|
| 42 |
+
def normalize_llm_content(raw: Any) -> str:
|
| 43 |
+
"""Normalise une réponse LLM en chaîne plate.
|
| 44 |
+
|
| 45 |
+
Chantier 4 (post-Sprint 97) — propagation du fix Mistral
|
| 46 |
+
Sprint 15 à tous les providers. Le SDK Mistral peut retourner
|
| 47 |
+
une liste de ``ContentChunk`` au lieu d'une chaîne pour certains
|
| 48 |
+
modèles/versions ; le SDK OpenAI peut faire de même quand on
|
| 49 |
+
active des features de structuration. Ce helper applique la même
|
| 50 |
+
discipline pour les 4 adapters :
|
| 51 |
+
|
| 52 |
+
- ``str`` → renvoyée telle quelle (ou ``""``).
|
| 53 |
+
- ``None`` → ``""``.
|
| 54 |
+
- ``list[ContentChunk]`` → concaténation des ``.text``.
|
| 55 |
+
- ``list[dict]`` avec clé ``text`` → concaténation des ``["text"]``.
|
| 56 |
+
- ``list[str]`` → concaténation directe.
|
| 57 |
+
- autre objet avec ``.text`` → ``obj.text``.
|
| 58 |
+
- autre → ``str(obj)`` (best-effort).
|
| 59 |
+
|
| 60 |
+
Le résultat est garanti être une ``str`` ; ``""`` quand la réponse
|
| 61 |
+
est vide. La fonction est idempotente : ``normalize_llm_content(s)
|
| 62 |
+
== s`` pour toute chaîne ``s``.
|
| 63 |
+
"""
|
| 64 |
+
if raw is None:
|
| 65 |
+
return ""
|
| 66 |
+
if isinstance(raw, str):
|
| 67 |
+
return raw
|
| 68 |
+
if isinstance(raw, list):
|
| 69 |
+
parts: list[str] = []
|
| 70 |
+
for chunk in raw:
|
| 71 |
+
if chunk is None:
|
| 72 |
+
continue
|
| 73 |
+
if isinstance(chunk, str):
|
| 74 |
+
parts.append(chunk)
|
| 75 |
+
continue
|
| 76 |
+
if hasattr(chunk, "text"):
|
| 77 |
+
txt = getattr(chunk, "text", None)
|
| 78 |
+
if isinstance(txt, str):
|
| 79 |
+
parts.append(txt)
|
| 80 |
+
continue
|
| 81 |
+
if isinstance(chunk, dict) and isinstance(chunk.get("text"), str):
|
| 82 |
+
parts.append(chunk["text"])
|
| 83 |
+
continue
|
| 84 |
+
# Dernier recours — convertit le chunk en chaîne
|
| 85 |
+
parts.append(str(chunk))
|
| 86 |
+
return "".join(parts)
|
| 87 |
+
if hasattr(raw, "text") and isinstance(getattr(raw, "text", None), str):
|
| 88 |
+
return raw.text # type: ignore[no-any-return]
|
| 89 |
+
return str(raw)
|
| 90 |
+
|
| 91 |
+
|
| 92 |
+
def log_http_error(
|
| 93 |
+
adapter_name: str,
|
| 94 |
+
model: str,
|
| 95 |
+
exc: Exception,
|
| 96 |
+
*,
|
| 97 |
+
env_var: Optional[str] = None,
|
| 98 |
+
) -> None:
|
| 99 |
+
"""Log standardisé des erreurs HTTP des SDK LLM.
|
| 100 |
+
|
| 101 |
+
Chantier 4 (post-Sprint 97) — propagation du log discriminant
|
| 102 |
+
Mistral/OpenAI à tous les providers. Inspecte ``status_code`` et
|
| 103 |
+
``http_status`` puis émet un warning ciblé selon le code :
|
| 104 |
+
|
| 105 |
+
- 401 : clé API invalide/expirée (mention de la variable
|
| 106 |
+
d'environnement à vérifier si fournie).
|
| 107 |
+
- 429 : rate limit / quota dépassé.
|
| 108 |
+
- 5xx : problème serveur côté provider.
|
| 109 |
+
- autre / pas de status_code : log générique.
|
| 110 |
+
|
| 111 |
+
L'exception n'est pas levée — l'appelant doit ``raise``
|
| 112 |
+
explicitement après ce log s'il veut propager (le retry est géré
|
| 113 |
+
par ``BaseLLMAdapter.complete`` selon ``_is_retryable``).
|
| 114 |
+
"""
|
| 115 |
+
status = getattr(exc, "status_code", None) or getattr(exc, "http_status", None)
|
| 116 |
+
if status == 401:
|
| 117 |
+
suffix = f" Vérifier {env_var}." if env_var else ""
|
| 118 |
+
logger.warning(
|
| 119 |
+
"[%s] erreur HTTP 401 — clé API invalide ou expirée "
|
| 120 |
+
"(modèle=%s).%s",
|
| 121 |
+
adapter_name, model, suffix,
|
| 122 |
+
)
|
| 123 |
+
elif status == 429:
|
| 124 |
+
logger.warning(
|
| 125 |
+
"[%s] erreur HTTP 429 — quota dépassé ou rate-limit "
|
| 126 |
+
"(modèle=%s). Réessayer plus tard.",
|
| 127 |
+
adapter_name, model,
|
| 128 |
+
)
|
| 129 |
+
elif status is not None and status >= 500:
|
| 130 |
+
logger.warning(
|
| 131 |
+
"[%s] erreur HTTP %d — problème serveur (modèle=%s) : %s",
|
| 132 |
+
adapter_name, status, model, exc,
|
| 133 |
+
)
|
| 134 |
+
else:
|
| 135 |
+
logger.warning(
|
| 136 |
+
"[%s] erreur lors de l'appel API (modèle=%s) : %s",
|
| 137 |
+
adapter_name, model, exc,
|
| 138 |
+
)
|
| 139 |
+
|
| 140 |
+
|
| 141 |
@dataclass
|
| 142 |
class LLMResult:
|
| 143 |
"""Résultat produit par un appel LLM."""
|
|
|
|
| 168 |
Les erreurs retryables (HTTP 429, 5xx, timeout réseau) sont automatiquement
|
| 169 |
retentées avec backoff exponentiel (2s, 4s, 8s par défaut). Configurable
|
| 170 |
via ``config["max_retries"]`` et ``config["retry_backoff"]``.
|
| 171 |
+
|
| 172 |
+
Normalisation des réponses (chantier 4)
|
| 173 |
+
---------------------------------------
|
| 174 |
+
Les sous-classes utilisent :func:`normalize_llm_content` sur la
|
| 175 |
+
réponse SDK avant de la retourner — garantit qu'une réponse de
|
| 176 |
+
type ``list[ContentChunk]`` (Mistral, parfois OpenAI) est
|
| 177 |
+
convertie en ``str`` plate.
|
| 178 |
+
|
| 179 |
+
Logging d'erreurs HTTP (chantier 4)
|
| 180 |
+
-----------------------------------
|
| 181 |
+
Les sous-classes utilisent :func:`log_http_error` pour produire
|
| 182 |
+
un log discriminant par ``status_code`` (401 → clé invalide,
|
| 183 |
+
429 → rate limit, 5xx → serveur). Auparavant ce log était
|
| 184 |
+
dupliqué chez Mistral/OpenAI et absent chez Anthropic.
|
| 185 |
"""
|
| 186 |
|
| 187 |
+
# Variable d'environnement portant la clé API. Sous-classes
|
| 188 |
+
# surchargent (ex. ``"OPENAI_API_KEY"``) ; mention utilisée par
|
| 189 |
+
# :func:`log_http_error` quand un 401 est rencontré. ``None``
|
| 190 |
+
# pour les providers sans clé (Ollama).
|
| 191 |
+
api_key_env_var: Optional[str] = None
|
| 192 |
+
|
| 193 |
def __init__(
|
| 194 |
self,
|
| 195 |
model: Optional[str] = None,
|
|
|
|
| 269 |
|
| 270 |
def __repr__(self) -> str:
|
| 271 |
return f"{self.__class__.__name__}(model={self.model!r})"
|
| 272 |
+
|
| 273 |
+
|
| 274 |
+
__all__ = [
|
| 275 |
+
"BaseLLMAdapter",
|
| 276 |
+
"LLMResult",
|
| 277 |
+
"log_http_error",
|
| 278 |
+
"normalize_llm_content",
|
| 279 |
+
]
|
|
@@ -6,7 +6,11 @@ import logging
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
-
from picarones.llm.base import
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -36,6 +40,8 @@ class MistralAdapter(BaseLLMAdapter):
|
|
| 36 |
pas le mode multimodal — utiliser ``PipelineMode.TEXT_ONLY`` avec ces modèles.
|
| 37 |
"""
|
| 38 |
|
|
|
|
|
|
|
| 39 |
@property
|
| 40 |
def name(self) -> str:
|
| 41 |
return "mistral"
|
|
@@ -109,30 +115,10 @@ class MistralAdapter(BaseLLMAdapter):
|
|
| 109 |
max_tokens=max_tokens,
|
| 110 |
)
|
| 111 |
except Exception as exc:
|
| 112 |
-
|
| 113 |
-
|
| 114 |
-
|
| 115 |
-
|
| 116 |
-
"(modèle=%s). Vérifier MISTRAL_API_KEY.",
|
| 117 |
-
self.model,
|
| 118 |
-
)
|
| 119 |
-
elif status_code == 429:
|
| 120 |
-
logger.warning(
|
| 121 |
-
"[MistralAdapter] erreur HTTP 429 — quota dépassé ou rate-limit "
|
| 122 |
-
"(modèle=%s). Réessayer plus tard.",
|
| 123 |
-
self.model,
|
| 124 |
-
)
|
| 125 |
-
elif status_code is not None and status_code >= 500:
|
| 126 |
-
logger.warning(
|
| 127 |
-
"[MistralAdapter] erreur HTTP %d — problème serveur Mistral "
|
| 128 |
-
"(modèle=%s) : %s",
|
| 129 |
-
status_code, self.model, exc,
|
| 130 |
-
)
|
| 131 |
-
else:
|
| 132 |
-
logger.warning(
|
| 133 |
-
"[MistralAdapter] erreur lors de l'appel API (modèle=%s) : %s",
|
| 134 |
-
self.model, exc,
|
| 135 |
-
)
|
| 136 |
raise
|
| 137 |
|
| 138 |
if not response.choices:
|
|
@@ -146,15 +132,10 @@ class MistralAdapter(BaseLLMAdapter):
|
|
| 146 |
raw = _choice.message.content
|
| 147 |
_finish_reason = _choice.finish_reason
|
| 148 |
|
| 149 |
-
#
|
| 150 |
-
#
|
| 151 |
-
|
| 152 |
-
|
| 153 |
-
chunk.text if hasattr(chunk, "text") else str(chunk)
|
| 154 |
-
for chunk in raw
|
| 155 |
-
)
|
| 156 |
-
|
| 157 |
-
text = raw or ""
|
| 158 |
|
| 159 |
_completion_tokens = None
|
| 160 |
if hasattr(response, "usage") and response.usage:
|
|
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
+
from picarones.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
|
| 15 |
logger = logging.getLogger(__name__)
|
| 16 |
|
|
|
|
| 40 |
pas le mode multimodal — utiliser ``PipelineMode.TEXT_ONLY`` avec ces modèles.
|
| 41 |
"""
|
| 42 |
|
| 43 |
+
api_key_env_var = "MISTRAL_API_KEY"
|
| 44 |
+
|
| 45 |
@property
|
| 46 |
def name(self) -> str:
|
| 47 |
return "mistral"
|
|
|
|
| 115 |
max_tokens=max_tokens,
|
| 116 |
)
|
| 117 |
except Exception as exc:
|
| 118 |
+
log_http_error(
|
| 119 |
+
"MistralAdapter", self.model, exc,
|
| 120 |
+
env_var=self.api_key_env_var,
|
| 121 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 122 |
raise
|
| 123 |
|
| 124 |
if not response.choices:
|
|
|
|
| 132 |
raw = _choice.message.content
|
| 133 |
_finish_reason = _choice.finish_reason
|
| 134 |
|
| 135 |
+
# Chantier 4 — normalisation factorisée dans
|
| 136 |
+
# ``picarones.llm.base.normalize_llm_content`` (Sprint 15
|
| 137 |
+
# généralisé : list[ContentChunk] / list[dict] / str → str).
|
| 138 |
+
text = normalize_llm_content(raw)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 139 |
|
| 140 |
_completion_tokens = None
|
| 141 |
if hasattr(response, "usage") and response.usage:
|
|
@@ -6,7 +6,7 @@ import logging
|
|
| 6 |
from typing import Optional
|
| 7 |
from urllib.parse import urlparse
|
| 8 |
|
| 9 |
-
from picarones.llm.base import BaseLLMAdapter
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -98,7 +98,10 @@ class OllamaAdapter(BaseLLMAdapter):
|
|
| 98 |
f"Réponse JSON invalide du serveur Ollama : {exc}"
|
| 99 |
) from exc
|
| 100 |
|
| 101 |
-
|
|
|
|
|
|
|
|
|
|
| 102 |
if not text:
|
| 103 |
logger.warning(
|
| 104 |
"[OllamaAdapter] réponse vide (modèle=%s).", self.model,
|
|
|
|
| 6 |
from typing import Optional
|
| 7 |
from urllib.parse import urlparse
|
| 8 |
|
| 9 |
+
from picarones.llm.base import BaseLLMAdapter, normalize_llm_content
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
|
|
| 98 |
f"Réponse JSON invalide du serveur Ollama : {exc}"
|
| 99 |
) from exc
|
| 100 |
|
| 101 |
+
# Chantier 4 — propagation du fix Sprint 15 : Ollama retourne
|
| 102 |
+
# ``response`` en string mais on normalise par défense (cas où
|
| 103 |
+
# un futur build retournerait un format structuré).
|
| 104 |
+
text = normalize_llm_content(result.get("response", ""))
|
| 105 |
if not text:
|
| 106 |
logger.warning(
|
| 107 |
"[OllamaAdapter] réponse vide (modèle=%s).", self.model,
|
|
@@ -6,7 +6,11 @@ import logging
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
-
from picarones.llm.base import
|
|
|
|
|
|
|
|
|
|
|
|
|
| 10 |
|
| 11 |
logger = logging.getLogger(__name__)
|
| 12 |
|
|
@@ -19,6 +23,8 @@ class OpenAIAdapter(BaseLLMAdapter):
|
|
| 19 |
Modes supportés : text_only, text_and_image, zero_shot.
|
| 20 |
"""
|
| 21 |
|
|
|
|
|
|
|
| 22 |
@property
|
| 23 |
def name(self) -> str:
|
| 24 |
return "openai"
|
|
@@ -70,21 +76,10 @@ class OpenAIAdapter(BaseLLMAdapter):
|
|
| 70 |
max_tokens=max_tokens,
|
| 71 |
)
|
| 72 |
except Exception as exc:
|
| 73 |
-
|
| 74 |
-
|
| 75 |
-
|
| 76 |
-
|
| 77 |
-
self.model,
|
| 78 |
-
)
|
| 79 |
-
elif status_code == 429:
|
| 80 |
-
logger.warning(
|
| 81 |
-
"[OpenAIAdapter] erreur HTTP 429 — rate limit (modèle=%s).",
|
| 82 |
-
self.model,
|
| 83 |
-
)
|
| 84 |
-
else:
|
| 85 |
-
logger.warning(
|
| 86 |
-
"[OpenAIAdapter] erreur API (modèle=%s) : %s", self.model, exc,
|
| 87 |
-
)
|
| 88 |
raise
|
| 89 |
|
| 90 |
if not response.choices:
|
|
@@ -92,4 +87,8 @@ class OpenAIAdapter(BaseLLMAdapter):
|
|
| 92 |
"[OpenAIAdapter] response.choices vide (modèle=%s).", self.model,
|
| 93 |
)
|
| 94 |
return ""
|
| 95 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 6 |
import os
|
| 7 |
from typing import Optional
|
| 8 |
|
| 9 |
+
from picarones.llm.base import (
|
| 10 |
+
BaseLLMAdapter,
|
| 11 |
+
log_http_error,
|
| 12 |
+
normalize_llm_content,
|
| 13 |
+
)
|
| 14 |
|
| 15 |
logger = logging.getLogger(__name__)
|
| 16 |
|
|
|
|
| 23 |
Modes supportés : text_only, text_and_image, zero_shot.
|
| 24 |
"""
|
| 25 |
|
| 26 |
+
api_key_env_var = "OPENAI_API_KEY"
|
| 27 |
+
|
| 28 |
@property
|
| 29 |
def name(self) -> str:
|
| 30 |
return "openai"
|
|
|
|
| 76 |
max_tokens=max_tokens,
|
| 77 |
)
|
| 78 |
except Exception as exc:
|
| 79 |
+
log_http_error(
|
| 80 |
+
"OpenAIAdapter", self.model, exc,
|
| 81 |
+
env_var=self.api_key_env_var,
|
| 82 |
+
)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 83 |
raise
|
| 84 |
|
| 85 |
if not response.choices:
|
|
|
|
| 87 |
"[OpenAIAdapter] response.choices vide (modèle=%s).", self.model,
|
| 88 |
)
|
| 89 |
return ""
|
| 90 |
+
# Chantier 4 — propagation du fix Sprint 15 : le SDK OpenAI
|
| 91 |
+
# peut retourner une ``list[ContentBlock]`` selon l'API
|
| 92 |
+
# (Responses, structured outputs). ``normalize_llm_content``
|
| 93 |
+
# gère les deux cas (str et list).
|
| 94 |
+
return normalize_llm_content(response.choices[0].message.content)
|
|
@@ -0,0 +1,277 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""Tests du chantier 4 (post-Sprint 97) : LLM + Gallica/IIIF + CLI workflows.
|
| 2 |
+
|
| 3 |
+
Couvre :
|
| 4 |
+
|
| 5 |
+
- Sous-chantier 4.A : ``normalize_llm_content`` + ``log_http_error``
|
| 6 |
+
factorisés dans :mod:`picarones.llm.base`, propagés aux 4 adapters.
|
| 7 |
+
- Sous-chantier 4.B : helpers HTTP factorisés dans
|
| 8 |
+
:mod:`picarones.importers._http`, Gallica et IIIF y délèguent.
|
| 9 |
+
- Sous-chantier 4.C : 3 nouvelles sous-commandes CLI ``diagnose``,
|
| 10 |
+
``economics``, ``edition`` qui mappent un profil de calcul
|
| 11 |
+
(chantier 2) à un workflow.
|
| 12 |
+
"""
|
| 13 |
+
|
| 14 |
+
from __future__ import annotations
|
| 15 |
+
|
| 16 |
+
import pytest
|
| 17 |
+
|
| 18 |
+
|
| 19 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 20 |
+
# 4.A — LLM base helpers
|
| 21 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 22 |
+
|
| 23 |
+
|
| 24 |
+
class TestNormalizeLlmContent:
|
| 25 |
+
def test_str_passes_through(self):
|
| 26 |
+
from picarones.llm.base import normalize_llm_content
|
| 27 |
+
|
| 28 |
+
assert normalize_llm_content("hello") == "hello"
|
| 29 |
+
# Idempotence : retourne l'objet exact pour str
|
| 30 |
+
s = "test"
|
| 31 |
+
assert normalize_llm_content(s) is s
|
| 32 |
+
|
| 33 |
+
def test_none_returns_empty(self):
|
| 34 |
+
from picarones.llm.base import normalize_llm_content
|
| 35 |
+
|
| 36 |
+
assert normalize_llm_content(None) == ""
|
| 37 |
+
|
| 38 |
+
def test_empty_string_passes(self):
|
| 39 |
+
from picarones.llm.base import normalize_llm_content
|
| 40 |
+
|
| 41 |
+
assert normalize_llm_content("") == ""
|
| 42 |
+
|
| 43 |
+
def test_list_of_chunks_with_text_attr(self):
|
| 44 |
+
"""Cas Mistral SDK : list[ContentChunk]. Sprint 15 fix."""
|
| 45 |
+
from picarones.llm.base import normalize_llm_content
|
| 46 |
+
|
| 47 |
+
class MockChunk:
|
| 48 |
+
def __init__(self, text):
|
| 49 |
+
self.text = text
|
| 50 |
+
|
| 51 |
+
result = normalize_llm_content([MockChunk("hello "), MockChunk("world")])
|
| 52 |
+
assert result == "hello world"
|
| 53 |
+
|
| 54 |
+
def test_list_of_dicts_with_text_key(self):
|
| 55 |
+
"""Cas Anthropic SDK : list[dict] avec clé 'text'."""
|
| 56 |
+
from picarones.llm.base import normalize_llm_content
|
| 57 |
+
|
| 58 |
+
result = normalize_llm_content([{"text": "a"}, {"text": "b"}])
|
| 59 |
+
assert result == "ab"
|
| 60 |
+
|
| 61 |
+
def test_list_of_strings(self):
|
| 62 |
+
from picarones.llm.base import normalize_llm_content
|
| 63 |
+
|
| 64 |
+
assert normalize_llm_content(["foo", "bar"]) == "foobar"
|
| 65 |
+
|
| 66 |
+
def test_mixed_list(self):
|
| 67 |
+
from picarones.llm.base import normalize_llm_content
|
| 68 |
+
|
| 69 |
+
class MockChunk:
|
| 70 |
+
def __init__(self, text):
|
| 71 |
+
self.text = text
|
| 72 |
+
|
| 73 |
+
result = normalize_llm_content([
|
| 74 |
+
MockChunk("a"), "b", {"text": "c"},
|
| 75 |
+
])
|
| 76 |
+
assert result == "abc"
|
| 77 |
+
|
| 78 |
+
def test_none_in_list_skipped(self):
|
| 79 |
+
from picarones.llm.base import normalize_llm_content
|
| 80 |
+
|
| 81 |
+
assert normalize_llm_content([None, "a", None, "b"]) == "ab"
|
| 82 |
+
|
| 83 |
+
def test_object_with_text_attribute(self):
|
| 84 |
+
from picarones.llm.base import normalize_llm_content
|
| 85 |
+
|
| 86 |
+
class TextHolder:
|
| 87 |
+
text = "hello"
|
| 88 |
+
assert normalize_llm_content(TextHolder()) == "hello"
|
| 89 |
+
|
| 90 |
+
|
| 91 |
+
class TestLogHttpError:
|
| 92 |
+
def test_401_logs_invalid_key(self, caplog):
|
| 93 |
+
from picarones.llm.base import log_http_error
|
| 94 |
+
|
| 95 |
+
class FakeExc(Exception):
|
| 96 |
+
status_code = 401
|
| 97 |
+
|
| 98 |
+
with caplog.at_level("WARNING"):
|
| 99 |
+
log_http_error("OpenAIAdapter", "gpt-4o", FakeExc("Unauthorized"),
|
| 100 |
+
env_var="OPENAI_API_KEY")
|
| 101 |
+
assert any("401" in r.message and "OPENAI_API_KEY" in r.message
|
| 102 |
+
for r in caplog.records)
|
| 103 |
+
|
| 104 |
+
def test_429_logs_rate_limit(self, caplog):
|
| 105 |
+
from picarones.llm.base import log_http_error
|
| 106 |
+
|
| 107 |
+
class FakeExc(Exception):
|
| 108 |
+
status_code = 429
|
| 109 |
+
|
| 110 |
+
with caplog.at_level("WARNING"):
|
| 111 |
+
log_http_error("MistralAdapter", "mistral-large", FakeExc("Too Many"))
|
| 112 |
+
assert any("429" in r.message and "rate" in r.message.lower()
|
| 113 |
+
for r in caplog.records)
|
| 114 |
+
|
| 115 |
+
def test_5xx_logs_server_error(self, caplog):
|
| 116 |
+
from picarones.llm.base import log_http_error
|
| 117 |
+
|
| 118 |
+
class FakeExc(Exception):
|
| 119 |
+
status_code = 503
|
| 120 |
+
|
| 121 |
+
with caplog.at_level("WARNING"):
|
| 122 |
+
log_http_error("AnthropicAdapter", "claude-sonnet", FakeExc("Service unavailable"))
|
| 123 |
+
assert any("503" in r.message and "serveur" in r.message.lower()
|
| 124 |
+
for r in caplog.records)
|
| 125 |
+
|
| 126 |
+
def test_no_status_code_logs_generic(self, caplog):
|
| 127 |
+
from picarones.llm.base import log_http_error
|
| 128 |
+
|
| 129 |
+
with caplog.at_level("WARNING"):
|
| 130 |
+
log_http_error("Foo", "bar", ValueError("random"))
|
| 131 |
+
# Doit produire un warning (générique)
|
| 132 |
+
assert any("Foo" in r.message for r in caplog.records)
|
| 133 |
+
|
| 134 |
+
|
| 135 |
+
class TestLlmAdaptersInheritEnvVar:
|
| 136 |
+
"""Le chantier 4 a ajouté ``api_key_env_var`` aux 3 adapters cloud."""
|
| 137 |
+
|
| 138 |
+
def test_mistral_declares_env_var(self):
|
| 139 |
+
from picarones.llm.mistral_adapter import MistralAdapter
|
| 140 |
+
assert MistralAdapter.api_key_env_var == "MISTRAL_API_KEY"
|
| 141 |
+
|
| 142 |
+
def test_openai_declares_env_var(self):
|
| 143 |
+
from picarones.llm.openai_adapter import OpenAIAdapter
|
| 144 |
+
assert OpenAIAdapter.api_key_env_var == "OPENAI_API_KEY"
|
| 145 |
+
|
| 146 |
+
def test_anthropic_declares_env_var(self):
|
| 147 |
+
from picarones.llm.anthropic_adapter import AnthropicAdapter
|
| 148 |
+
assert AnthropicAdapter.api_key_env_var == "ANTHROPIC_API_KEY"
|
| 149 |
+
|
| 150 |
+
def test_ollama_no_env_var(self):
|
| 151 |
+
"""Ollama est local — pas de clé API."""
|
| 152 |
+
from picarones.llm.ollama_adapter import OllamaAdapter
|
| 153 |
+
assert OllamaAdapter.api_key_env_var is None
|
| 154 |
+
|
| 155 |
+
|
| 156 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 157 |
+
# 4.B — Helpers HTTP factorisés (Gallica → IIIF fusion)
|
| 158 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 159 |
+
|
| 160 |
+
|
| 161 |
+
class TestHttpHelpers:
|
| 162 |
+
def test_validate_http_url_accepts_https(self):
|
| 163 |
+
from picarones.importers._http import validate_http_url
|
| 164 |
+
validate_http_url("https://gallica.bnf.fr/test") # ne lève pas
|
| 165 |
+
|
| 166 |
+
def test_validate_http_url_accepts_http(self):
|
| 167 |
+
from picarones.importers._http import validate_http_url
|
| 168 |
+
validate_http_url("http://localhost:8080/x")
|
| 169 |
+
|
| 170 |
+
@pytest.mark.parametrize("scheme", ["file", "ftp", "data", "javascript", "ssh"])
|
| 171 |
+
def test_validate_http_url_rejects_other_schemes(self, scheme):
|
| 172 |
+
from picarones.importers._http import validate_http_url
|
| 173 |
+
with pytest.raises(ValueError, match="non autorisé"):
|
| 174 |
+
validate_http_url(f"{scheme}://example.com/x")
|
| 175 |
+
|
| 176 |
+
|
| 177 |
+
class TestIiifAliasesDelegateToHttp:
|
| 178 |
+
"""Les noms ``_validate_url`` et ``_download_url`` exposés depuis
|
| 179 |
+
:mod:`picarones.importers.iiif` doivent rester disponibles
|
| 180 |
+
(rétrocompat des tests Sprint 4) — ils délèguent aux helpers
|
| 181 |
+
factorisés."""
|
| 182 |
+
|
| 183 |
+
def test_iiif_validate_url_is_alias(self):
|
| 184 |
+
from picarones.importers import iiif
|
| 185 |
+
from picarones.importers._http import validate_http_url
|
| 186 |
+
assert iiif._validate_url is validate_http_url
|
| 187 |
+
|
| 188 |
+
def test_iiif_download_url_is_alias(self):
|
| 189 |
+
from picarones.importers import iiif
|
| 190 |
+
from picarones.importers._http import download_url
|
| 191 |
+
assert iiif._download_url is download_url
|
| 192 |
+
|
| 193 |
+
|
| 194 |
+
class TestGallicaDelegatesToHttp:
|
| 195 |
+
def test_gallica_validate_url_delegates(self):
|
| 196 |
+
from picarones.importers.gallica import GallicaClient
|
| 197 |
+
client = GallicaClient()
|
| 198 |
+
# Doit accepter https
|
| 199 |
+
client._validate_url("https://gallica.bnf.fr/x")
|
| 200 |
+
# Doit rejeter un schéma invalide via le helper factorisé
|
| 201 |
+
with pytest.raises(ValueError, match="non autorisé"):
|
| 202 |
+
client._validate_url("file:///etc/passwd")
|
| 203 |
+
|
| 204 |
+
def test_gallica_uses_iiif_for_image_download(self):
|
| 205 |
+
"""``GallicaClient.import_document`` délègue à IIIFImporter."""
|
| 206 |
+
# Lecture statique du source — pas d'appel réseau
|
| 207 |
+
from pathlib import Path
|
| 208 |
+
gallica_src = (
|
| 209 |
+
Path(__file__).parent.parent
|
| 210 |
+
/ "picarones" / "importers" / "gallica.py"
|
| 211 |
+
).read_text(encoding="utf-8")
|
| 212 |
+
# Confirme que Gallica importe IIIFImporter
|
| 213 |
+
assert "from picarones.importers.iiif import IIIFImporter" in gallica_src
|
| 214 |
+
|
| 215 |
+
|
| 216 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 217 |
+
# 4.C — Workflows CLI dédiés
|
| 218 |
+
# ──────────────────────────────────────────────────────────────────────────
|
| 219 |
+
|
| 220 |
+
|
| 221 |
+
class TestCliWorkflows:
|
| 222 |
+
def test_three_new_commands_registered(self):
|
| 223 |
+
from pathlib import Path
|
| 224 |
+
|
| 225 |
+
cli_src = (
|
| 226 |
+
Path(__file__).parent.parent / "picarones" / "cli.py"
|
| 227 |
+
).read_text(encoding="utf-8")
|
| 228 |
+
# Vérification statique : les 3 commandes existent
|
| 229 |
+
assert '@cli.command("diagnose")' in cli_src
|
| 230 |
+
assert '@cli.command("economics")' in cli_src
|
| 231 |
+
assert '@cli.command("edition")' in cli_src
|
| 232 |
+
assert "def diagnose_cmd(" in cli_src
|
| 233 |
+
assert "def economics_cmd(" in cli_src
|
| 234 |
+
assert "def edition_cmd(" in cli_src
|
| 235 |
+
|
| 236 |
+
def test_workflows_map_correct_profile(self):
|
| 237 |
+
from pathlib import Path
|
| 238 |
+
cli_src = (
|
| 239 |
+
Path(__file__).parent.parent / "picarones" / "cli.py"
|
| 240 |
+
).read_text(encoding="utf-8")
|
| 241 |
+
# Chaque commande doit fixer le bon profil
|
| 242 |
+
# diagnose → diagnostics, economics → economics, edition → philological
|
| 243 |
+
assert 'profile="diagnostics"' in cli_src
|
| 244 |
+
assert 'profile="economics"' in cli_src
|
| 245 |
+
assert 'profile="philological"' in cli_src
|
| 246 |
+
|
| 247 |
+
def test_run_workflow_helper_exists(self):
|
| 248 |
+
"""Le helper commun ``_run_workflow`` factorise la logique des
|
| 249 |
+
4 commandes (run + diagnose + economics + edition) — un seul
|
| 250 |
+
endroit pour patcher si la logique évolue."""
|
| 251 |
+
import ast
|
| 252 |
+
from pathlib import Path
|
| 253 |
+
|
| 254 |
+
cli_src = (
|
| 255 |
+
Path(__file__).parent.parent / "picarones" / "cli.py"
|
| 256 |
+
).read_text(encoding="utf-8")
|
| 257 |
+
tree = ast.parse(cli_src)
|
| 258 |
+
funcs = {
|
| 259 |
+
n.name for n in ast.walk(tree) if isinstance(n, ast.FunctionDef)
|
| 260 |
+
}
|
| 261 |
+
assert "_run_workflow" in funcs
|
| 262 |
+
|
| 263 |
+
@pytest.mark.parametrize("cmd_name", ["diagnose", "economics", "edition"])
|
| 264 |
+
def test_command_help_works(self, cmd_name):
|
| 265 |
+
"""Les 3 commandes répondent à --help sans crash."""
|
| 266 |
+
try:
|
| 267 |
+
from click.testing import CliRunner
|
| 268 |
+
|
| 269 |
+
from picarones.cli import cli as cli_group
|
| 270 |
+
except ImportError:
|
| 271 |
+
pytest.skip("click non installé")
|
| 272 |
+
|
| 273 |
+
runner = CliRunner()
|
| 274 |
+
result = runner.invoke(cli_group, [cmd_name, "--help"])
|
| 275 |
+
assert result.exit_code == 0, result.output
|
| 276 |
+
assert "--corpus" in result.output
|
| 277 |
+
assert "--engines" in result.output
|