Spaces:

Ma-Ri-Ba-Ku
/

Picarones

Sleeping

Claude commited on May 6

Commit

19e1a5d

unverified ·

1 Parent(s): 27d155d

feat(app/services,interfaces/web): Sprint A14-S48 — JobRunner + POST /api/jobs (fix audit #2)

L'audit avait identifié que JobStore (S37) était à moitié branché :
- POST /api/jobs manquant — impossible de créer un job via l'API.
- mark_orphaned_jobs_interrupted() documenté mais jamais appelé au boot.
- Pas d'orchestrateur async qui pousse les jobs.

Ce sprint ferme les 3 chantiers.

picarones/app/services/job_runner.py (nouveau, 247 lignes)
----------------------------------------------------------
Service applicatif qui pont entre l'API web et RunOrchestrator :

- JobRunner(job_store, orchestrator_factory, report_renderer=None).
- submit(run_spec, output_dir, job_id=None, payload=None) -> job_id :
· crée un JobRecord (pending) ;
· spawn un threading.Thread(daemon=True) ;
· retourne immédiatement.
- _run worker thread :
· vérifie statut pré-démarrage (skip si cancelled avant start) ;
· mark_running, exécute orchestrator.execute() ;
· capture toutes les exceptions → mark_error avec
"{type}: {msg}" ;
· vérifie statut post-exécution (cancelled pendant le run →
résultat discardé, statut reste cancelled) ;
· sinon mark_complete avec output_path = manifest persisté.
- wait(job_id, timeout) helper pour les tests.

Cancellation coopérative best-effort : DELETE /api/jobs/{id} marque
cancelled, le worker observe à 2 checkpoints (pré et post execute).
Pendant l'exécution longue de orchestrator.execute, le worker ne
peut pas l'interrompre (pas de cancel_event natif sur l'orchestrator —
amélioration future).

picarones/interfaces/web/app.py (modifié)
-----------------------------------------
- WebAppState : nouveau champ optionnel job_runner: JobRunner | None.
- Lifespan hook @asynccontextmanager : au boot, appelle
state.job_store.mark_orphaned_jobs_interrupted() si store
configuré, log le compte ; tolère les exceptions sqlite (log error,
l'app continue à démarrer).

picarones/interfaces/web/routers/jobs.py (modifié)
--------------------------------------------------
- Nouveau endpoint POST /api/jobs avec status_code=202 :
· accepte le YAML d'un RunSpec en raw body (Body(media_type="text/plain")) ;
· rejette body vide → 400 ;
· parse + valide via load_run_spec_from_yaml → 400 si invalide ;
· output_dir = workspace.root / "runs" / {job_id} ;
· délègue à state.job_runner.submit ;
· retourne JobSubmitResponse avec job_id + status="pending" ;
· 503 si job_runner non configuré.
- Helper _require_job_runner(state) symétrique à _require_job_store.

Tests S48 dédiés (13 nouveaux)
------------------------------
- TestJobRunnerConstructor : rejet non-JobStore, non-callable factory,
non-callable renderer.
- TestJobRunnerHappyPath : submit → mark_complete avec output_path,
UUID4 unique sans job_id explicite, job_id explicite respecté.
- TestJobRunnerErrorPath : exception orchestrator → mark_error avec
type+message.
- TestJobRunnerCancellation : cancel pendant l'exécution → résultat
discardé, statut reste cancelled.
- TestLifespanHook : 2 jobs zombie (pending + running) au démarrage
→ marked interrupted ; jobs déjà complete intacts ; /health
répond.
- TestPostJobsEndpoint : YAML valide → 202 avec job_id, YAML invalide
→ 400, body vide → 400 ou 422 (Pydantic), pas de runner → 503.

Tests : 4933 passed, 11 skipped (vs 4920 avant : +13 S48).
Lint : ruff check picarones/ tests/ → All checks passed.

Pourquoi ce fix maintenant
--------------------------
La directive *« sans dette technique »* exigeait que tout code livré
soit utilisable bout-en-bout. Pour les jobs, S37 livrait la
persistance + lecture mais pas la création. L'utilisateur ne pouvait
PAS soumettre un benchmark via l'API, alors que le legacy
picarones/web/jobs.py l'autorisait.

Le branchement S48 réalise enfin la promesse implicite d'une API
de jobs complète : POST → 202, GET → status, DELETE → cancel.

https://claude.ai/code/session_011XQZNitg1rCgia8ZD1a2hP

Files changed (6) hide show

README.md +1 -1
picarones/app/services/__init__.py +2 -0
picarones/app/services/job_runner.py +256 -0
picarones/interfaces/web/app.py +30 -1
picarones/interfaces/web/routers/jobs.py +121 -9
tests/app/services/test_sprint_a14_s48_job_runner.py +380 -0

README.md CHANGED Viewed

@@ -396,7 +396,7 @@ ruff check picarones/ tests/
 python -m mypy picarones/core/
 ```
-**Test suite**: ~4940 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

 python -m mypy picarones/core/
 ```
+**Test suite**: ~4950 tests, ~3 min on a modern laptop. Coverage
 floor at 85% (currently ~87%). The `network` marker excludes tests
 requiring live HTTP. A handful of tests depend on optional engines
 (`pero-ocr`, `pytesseract`) and are skipped/fail gracefully when

picarones/app/services/__init__.py CHANGED Viewed

@@ -34,6 +34,7 @@ from picarones.app.services.corpus_service import (
     CorpusImportReport,
     CorpusService,
 )
 from picarones.app.services.path_security import (
     PathValidationError,
     WorkspaceManager,
@@ -63,6 +64,7 @@ __all__ = [
     "CorpusImportReport",
     "CorpusService",
     "GroundTruthFactory",
     "OrchestrationResult",
     "PathValidationError",
     "PipelineInputsFactory",

     CorpusImportReport,
     CorpusService,
 )
+from picarones.app.services.job_runner import JobRunner
 from picarones.app.services.path_security import (
     PathValidationError,
     WorkspaceManager,
     "CorpusImportReport",
     "CorpusService",
     "GroundTruthFactory",
+    "JobRunner",
     "OrchestrationResult",
     "PathValidationError",
     "PipelineInputsFactory",

picarones/app/services/job_runner.py ADDED Viewed

	@@ -0,0 +1,256 @@

+"""``JobRunner`` — Sprint A14-S48.
+Fix audit #2 : avant ce sprint, ``JobStore`` (S37) existait avec ses
+endpoints ``GET / DELETE /api/jobs``, mais aucun moyen de **créer**
+un job via l'API — pas de ``POST /api/jobs``, pas d'orchestrateur
+async qui pousse les jobs.  ``mark_orphaned_jobs_interrupted()`` était
+documenté mais jamais appelé au boot.
+``JobRunner`` est le pont manquant entre l'API web et
+``RunOrchestrator``.  Il :
+1. Crée un ``JobRecord`` dans le ``JobStore`` (status ``pending``).
+2. Lance un **thread daemon** qui exécute l'orchestrator de façon
+   synchrone.
+3. Met à jour le statut au fur et à mesure : ``running`` au démarrage,
+   ``complete`` ou ``error`` à la fin.
+4. Si le caller annule via ``DELETE /api/jobs/{id}`` (qui appelle
+   ``store.mark_cancelled``), le thread l'observe au prochain check
+   et abandonne — le résultat partiel est discardé.
+Pourquoi un thread, pas asyncio
+-------------------------------
+``RunOrchestrator.execute`` est **synchrone** et utilise un
+``ThreadPoolExecutor`` interne (``CorpusRunner``).  Le wrapper avec
+asyncio créerait du complexité gratuite (mix sync/async, GIL).
+Un ``threading.Thread(daemon=True)`` est l'outil correct ici.
+Cancellation coopérative
+------------------------
+Pour S48, la cancellation est **best-effort** : le thread vérifie
+``store.get(job_id).status == "cancelled"`` AVANT et APRÈS l'appel
+à ``orchestrator.execute``.  Pendant l'exécution (potentiellement
+plusieurs minutes), le thread ne peut pas interrompre l'orchestrator
+sans support natif (cf. ``CorpusRunner.run(cancel_event=...)`` —
+non encore propagé jusqu'à ``RunOrchestrator``).
+Conséquence : ``DELETE /api/jobs/{id}`` pendant que le thread tourne
+marque le statut comme ``cancelled``, mais le benchmark continue et
+son résultat est discardé à la fin.  Une amélioration future
+propagerait le ``cancel_event`` jusqu'au runner.
+Anti-sur-ingénierie
+-------------------
+- Pas de queue de jobs avec backpressure : un thread par submit.
+  Pour 100+ jobs simultanés, ajouter un ``ThreadPoolExecutor`` au
+  niveau du runner.
+- Pas de retry automatique sur échec.
+- Pas de notification SSE des changements de statut (le caller
+  poll ``GET /api/jobs/{id}``).
+"""
+from __future__ import annotations
+import logging
+import threading
+import uuid
+from pathlib import Path
+from typing import Any, Callable
+from picarones.adapters.storage import JobStore
+logger = logging.getLogger(__name__)
+# Factory : un caller fournit un callable qui construit un
+# ``RunOrchestrator`` lié à un ``output_dir`` donné.  L'inversion
+# évite à ce module d'importer ``RunOrchestrator`` directement
+# (cycles potentiels) et permet aux tests d'injecter un mock.
+OrchestratorFactory = Callable[[Path], Any]
+ReportRenderer = Callable[[Any, Path, str], Path]
+class JobRunner:
+    """Lance des jobs de benchmark en arrière-plan.
+    Parameters
+    ----------
+    job_store:
+        ``JobStore`` partagé avec les endpoints de lecture
+        (``GET /api/jobs``, ``DELETE /api/jobs/{id}``).
+    orchestrator_factory:
+        Callable ``(output_dir: Path) -> RunOrchestrator`` qui
+        construit un orchestrator par job.  Permet à chaque job
+        d'avoir son propre output_dir isolé.
+    report_renderer:
+        Optionnel — passé à ``orchestrator.execute()`` pour rendre
+        le rapport HTML.  Si ``None``, pas de rapport produit.
+    Notes
+    -----
+    L'instance est thread-safe : ``submit`` est appelé depuis le
+    thread FastAPI, le thread daemon écrit dans ``JobStore`` qui
+    sérialise ses opérations SQLite.
+    """
+    def __init__(
+        self,
+        job_store: JobStore,
+        orchestrator_factory: OrchestratorFactory,
+        report_renderer: ReportRenderer | None = None,
+    ) -> None:
+        if not isinstance(job_store, JobStore):
+            raise TypeError("job_store doit être un JobStore.")
+        if not callable(orchestrator_factory):
+            raise TypeError("orchestrator_factory doit être callable.")
+        if report_renderer is not None and not callable(report_renderer):
+            raise TypeError("report_renderer doit être callable ou None.")
+        self._store = job_store
+        self._factory = orchestrator_factory
+        self._report_renderer = report_renderer
+        # Tracking des threads actifs — utile pour les tests qui
+        # attendent la fin d'un job soumis.
+        self._threads: dict[str, threading.Thread] = {}
+    # ──────────────────────────────────────────────────────────────────
+    # API publique
+    # ──────────────────────────────────────────────────────────────────
+    def submit(
+        self,
+        run_spec: Any,
+        output_dir: Path | str,
+        *,
+        job_id: str | None = None,
+        payload: dict | None = None,
+    ) -> str:
+        """Crée un job et lance son exécution en thread arrière-plan.
+        Returns
+        -------
+        str
+            ``job_id`` (généré si non fourni).  Utilisable pour
+            interroger ``GET /api/jobs/{job_id}``.
+        Notes
+        -----
+        Idempotent uniquement si ``job_id`` est fourni explicitement
+        (sinon UUID4 garantit l'unicité).  Si le ``job_id`` existe
+        déjà, ``JobStore.create`` lève ``JobStoreError``.
+        """
+        job_id = job_id or uuid.uuid4().hex
+        out_path = Path(output_dir)
+        # ``payload`` est sérialisé en JSON dans le store — on stocke
+        # la version du run_spec pour traçabilité.
+        record_payload = dict(payload or {})
+        record_payload.setdefault("output_dir", str(out_path))
+        self._store.create(job_id, payload=record_payload)
+        thread = threading.Thread(
+            target=self._run,
+            args=(job_id, run_spec, out_path),
+            daemon=True,
+            name=f"picarones-job-{job_id[:8]}",
+        )
+        self._threads[job_id] = thread
+        thread.start()
+        logger.info("[job_runner] job %s soumis (thread démarré).", job_id)
+        return job_id
+    def wait(self, job_id: str, timeout: float | None = None) -> bool:
+        """Attend la fin du thread d'un job (utile aux tests).
+        Returns
+        -------
+        bool
+            ``True`` si le thread est terminé, ``False`` si timeout.
+        """
+        thread = self._threads.get(job_id)
+        if thread is None:
+            return True  # job inconnu = considéré fini
+        thread.join(timeout=timeout)
+        return not thread.is_alive()
+    # ──────────────────────────────────────────────────────────────────
+    # Worker thread
+    # ──────────────────────────────────────────────────────────────────
+    def _run(
+        self,
+        job_id: str,
+        run_spec: Any,
+        output_dir: Path,
+    ) -> None:
+        """Logique exécutée dans le thread daemon.  Capture toutes les
+        exceptions et les transcrit en statut ``error`` du store.
+        Hooks de cancellation coopérative :
+        - **Avant** ``orchestrator.execute()`` : si le statut a été
+          basculé en ``cancelled`` entre le ``submit`` et le démarrage
+          du thread, on saute l'exécution.
+        - **Après** ``orchestrator.execute()`` : si le statut a été
+          basculé en ``cancelled`` pendant l'exécution, on discarde
+          le résultat (le statut reste ``cancelled``).
+        Sinon, statut final = ``complete`` ou ``error``.
+        """
+        # 1. Check pré-démarrage : annulé avant que le thread n'ait
+        #    pris la main ?
+        rec = self._store.get(job_id)
+        if rec is None:
+            logger.warning(
+                "[job_runner] job %s introuvable au démarrage du "
+                "thread — abandon.", job_id,
+            )
+            return
+        if rec.status == "cancelled":
+            logger.info(
+                "[job_runner] job %s annulé avant démarrage — skip.",
+                job_id,
+            )
+            return
+        # 2. Marquer en cours.
+        try:
+            self._store.mark_running(job_id)
+        except Exception as exc:  # noqa: BLE001
+            logger.error(
+                "[job_runner] échec mark_running sur %s : %s — abandon.",
+                job_id, exc,
+            )
+            return
+        # 3. Exécution effective.
+        try:
+            orchestrator = self._factory(output_dir)
+            result = orchestrator.execute(
+                run_spec,
+                report_renderer=self._report_renderer,
+            )
+        except Exception as exc:  # noqa: BLE001
+            error_msg = f"{type(exc).__name__}: {exc}"
+            logger.error(
+                "[job_runner] job %s en échec : %s",
+                job_id, error_msg,
+            )
+            self._store.mark_error(job_id, error_msg)
+            return
+        # 4. Check post-exécution : annulé pendant que le run tournait ?
+        rec_after = self._store.get(job_id)
+        if rec_after is not None and rec_after.status == "cancelled":
+            logger.info(
+                "[job_runner] job %s annulé pendant l'exécution — "
+                "résultat discardé.", job_id,
+            )
+            return
+        # 5. Succès — output_path = chemin du manifest persisté.
+        manifest_path = result.persisted_files.get("manifest")
+        output_path_str = str(manifest_path) if manifest_path else ""
+        self._store.mark_complete(job_id, output_path=output_path_str)
+        logger.info("[job_runner] job %s terminé avec succès.", job_id)
+__all__ = ["JobRunner", "OrchestratorFactory", "ReportRenderer"]

picarones/interfaces/web/app.py CHANGED Viewed

@@ -36,6 +36,8 @@ mount des fichiers statiques.
 from __future__ import annotations
 from dataclasses import dataclass
 from pathlib import Path
@@ -45,10 +47,13 @@ from fastapi.staticfiles import StaticFiles
 from fastapi.templating import Jinja2Templates
 from pydantic import BaseModel
 from picarones.adapters.storage import JobStore
 from picarones.app.services import (
     BenchmarkService,
     CorpusService,
     RegistryService,
     RunOrchestrator,
     WorkspaceManager,
@@ -98,6 +103,7 @@ class WebAppState:
     benchmark: BenchmarkService
     orchestrator: RunOrchestrator
     job_store: JobStore | None = None
     version: str = "1.0.0"
@@ -141,15 +147,38 @@ def create_app(state: WebAppState) -> FastAPI:
             f"reçu {type(state).__name__}.",
         )
     app = FastAPI(
         title="Picarones",
         description=(
             "Plateforme de benchmark OCR/HTR pour documents patrimoniaux. "
-            "API du nouveau monde (Sprint A14-S35)."
         ),
         version=state.version,
         docs_url="/api/docs",
         redoc_url="/api/redoc",
     )
     # On stocke l'état dans app.state.picarones pour permettre aux

 from __future__ import annotations
+import logging
+from contextlib import asynccontextmanager
 from dataclasses import dataclass
 from pathlib import Path
 from fastapi.templating import Jinja2Templates
 from pydantic import BaseModel
+_logger = logging.getLogger(__name__)
 from picarones.adapters.storage import JobStore
 from picarones.app.services import (
     BenchmarkService,
     CorpusService,
+    JobRunner,
     RegistryService,
     RunOrchestrator,
     WorkspaceManager,
     benchmark: BenchmarkService
     orchestrator: RunOrchestrator
     job_store: JobStore | None = None
+    job_runner: JobRunner | None = None
     version: str = "1.0.0"
             f"reçu {type(state).__name__}.",
         )
+    # Lifespan hook (S48) : nettoyage des jobs zombies au boot.
+    # Tout job en statut ``pending`` ou ``running`` au démarrage du
+    # process est forcément orphelin (le process précédent est mort
+    # sans le finir).  On les bascule en ``interrupted`` pour ne pas
+    # laisser d'état mensonger sur le tableau de bord.
+    @asynccontextmanager
+    async def _lifespan(_app: FastAPI):
+        if state.job_store is not None:
+            try:
+                n = state.job_store.mark_orphaned_jobs_interrupted()
+                if n > 0:
+                    _logger.info(
+                        "[lifespan] %d job(s) orphelin(s) marqué(s) "
+                        "interrupted au boot.", n,
+                    )
+            except Exception as exc:  # noqa: BLE001 — défense en profondeur
+                _logger.error(
+                    "[lifespan] mark_orphaned_jobs_interrupted ÉCHOUÉ "
+                    "— jobs zombies possibles : %s", exc,
+                )
+        yield
     app = FastAPI(
         title="Picarones",
         description=(
             "Plateforme de benchmark OCR/HTR pour documents patrimoniaux. "
+            "API du nouveau monde (Sprint A14-S35+)."
         ),
         version=state.version,
         docs_url="/api/docs",
         redoc_url="/api/redoc",
+        lifespan=_lifespan,
     )
     # On stocke l'état dans app.state.picarones pour permettre aux

picarones/interfaces/web/routers/jobs.py CHANGED Viewed

@@ -1,19 +1,18 @@
-"""Router jobs — Sprint A14-S37.
-Endpoints de listing/lecture/cancellation des jobs de benchmark
-persistés via ``JobStore`` (S37, ``picarones.adapters.storage``).
 Endpoints
 ---------
 - ``GET    /api/jobs``            : liste des jobs (récents en tête).
 - ``GET    /api/jobs/{job_id}``   : détail + progression.
 - ``DELETE /api/jobs/{job_id}``   : annulation explicite.
-L'endpoint **POST /api/jobs** (création + lancement asynchrone) est
-volontairement reporté à un sprint dédié de l'intégration runtime
-— il nécessite un thread d'exécution branché sur ``RunOrchestrator``
-(au-delà du périmètre S37 qui livre la persistance + les endpoints
-de lecture).
 Anti-sur-ingénierie
 -------------------
@@ -28,9 +27,14 @@ from __future__ import annotations
 import logging
-from fastapi import APIRouter, HTTPException, Request, status
 from pydantic import BaseModel, Field
 logger = logging.getLogger(__name__)
@@ -82,6 +86,19 @@ class JobCancelResponse(BaseModel):
     status: str
 # ──────────────────────────────────────────────────────────────────────
 # Helpers
 # ──────────────────────────────────────────────────────────────────────
@@ -99,6 +116,19 @@ def _require_job_store(state) -> "object":
     return state.job_store
 def _to_summary(rec) -> JobSummary:
     return JobSummary(
         job_id=rec.job_id,
@@ -135,6 +165,88 @@ def _to_detail(rec) -> JobDetailResponse:
 # ──────────────────────────────────────────────────────────────────────
 @router.get("", response_model=JobListResponse)
 async def list_jobs(request: Request) -> JobListResponse:
     """Liste les jobs (récents en tête)."""

+"""Router jobs — Sprints A14-S37 + S48.
+Endpoints de gestion des jobs de benchmark, adossés à
+``JobStore`` (S37) + ``JobRunner`` (S48).
 Endpoints
 ---------
 - ``GET    /api/jobs``            : liste des jobs (récents en tête).
 - ``GET    /api/jobs/{job_id}``   : détail + progression.
+- ``POST   /api/jobs``            : création + lancement asynchrone.
 - ``DELETE /api/jobs/{job_id}``   : annulation explicite.
+S37 (initial) livrait les 3 premiers (lecture + cancellation).
+S48 ajoute ``POST`` qui était identifié comme **manque critique**
+dans l'audit du rewrite (l'audit #2).
 Anti-sur-ingénierie
 -------------------
 import logging
+from fastapi import APIRouter, Body, HTTPException, Request, status
 from pydantic import BaseModel, Field
+from picarones.app.schemas.run_spec import (
+    RunSpecLoadError,
+    load_run_spec_from_yaml,
+)
 logger = logging.getLogger(__name__)
     status: str
+class JobSubmitResponse(BaseModel):
+    """Réponse JSON pour ``POST /api/jobs`` (202 Accepted)."""
+    job_id: str
+    status: str = Field(
+        default="pending",
+        description=(
+            "Statut au moment de la soumission.  Le client poll "
+            "``GET /api/jobs/{job_id}`` pour suivre la progression."
+        ),
+    )
 # ──────────────────────────────────────────────────────────────────────
 # Helpers
 # ──────────────────────────────────────────────────────────────────────
     return state.job_store
+def _require_job_runner(state) -> "object":
+    if state.job_runner is None:
+        raise HTTPException(
+            status_code=status.HTTP_503_SERVICE_UNAVAILABLE,
+            detail=(
+                "Job runner non configuré dans WebAppState — "
+                "l'exécution asynchrone des jobs n'est pas activée. "
+                "Voir picarones.app.services.JobRunner pour le câblage."
+            ),
+        )
+    return state.job_runner
 def _to_summary(rec) -> JobSummary:
     return JobSummary(
         job_id=rec.job_id,
 # ──────────────────────────────────────────────────────────────────────
+@router.post(
+    "",
+    response_model=JobSubmitResponse,
+    status_code=status.HTTP_202_ACCEPTED,
+)
+async def submit_job(
+    request: Request,
+    run_spec_yaml: str = Body(
+        ...,
+        media_type="text/plain",
+        description=(
+            "Contenu YAML d'un ``RunSpec`` (cf. picarones.app.schemas."
+            "run_spec).  Le corps de la requête est le YAML brut."
+        ),
+    ),
+) -> JobSubmitResponse:
+    """Crée un job + lance son exécution en arrière-plan (S48).
+    Le corps de la requête est le YAML brut d'un ``RunSpec`` (mêmes
+    champs que ce que la CLI ``picarones-rewrite run`` accepte).
+    Comportement :
+    1. Le YAML est parsé et validé (``load_run_spec_from_yaml``).
+       Erreur de format → 400 avec message du loader.
+    2. Un ``JobRecord`` est créé en statut ``pending`` avec un
+       ``job_id`` UUID4.
+    3. Un thread daemon est lancé pour exécuter le ``RunOrchestrator``
+       avec le ``RunSpec``.
+    4. Réponse immédiate ``202 Accepted`` avec ``job_id`` — le
+       client poll ``GET /api/jobs/{job_id}`` pour suivre.
+    Concurrence
+    -----------
+    Un thread par job ; pas de queue/backpressure.  Pour 100+ jobs
+    simultanés, ajouter un ``ThreadPoolExecutor`` au niveau de
+    ``JobRunner`` (post-livraison).
+    """
+    state = request.app.state.picarones
+    runner = _require_job_runner(state)
+    if not run_spec_yaml or not run_spec_yaml.strip():
+        raise HTTPException(
+            status_code=status.HTTP_400_BAD_REQUEST,
+            detail="Corps de la requête vide — YAML RunSpec attendu.",
+        )
+    try:
+        run_spec = load_run_spec_from_yaml(run_spec_yaml)
+    except RunSpecLoadError as exc:
+        raise HTTPException(
+            status_code=status.HTTP_400_BAD_REQUEST,
+            detail=f"RunSpec invalide : {exc}",
+        ) from exc
+    # Output dir : sous-dossier dédié au job dans le workspace.  Le
+    # JobRunner s'en sert pour construire un RunOrchestrator isolé.
+    import uuid
+    job_id_candidate = uuid.uuid4().hex
+    output_dir = (
+        state.workspace.root / "runs" / job_id_candidate
+    )
+    try:
+        job_id = runner.submit(
+            run_spec=run_spec,
+            output_dir=output_dir,
+            job_id=job_id_candidate,
+            payload={"corpus_name": run_spec.corpus_name or ""},
+        )
+    except Exception as exc:  # noqa: BLE001
+        logger.error(
+            "[jobs] échec de submit pour run_spec : %s", exc, exc_info=True,
+        )
+        raise HTTPException(
+            status_code=status.HTTP_500_INTERNAL_SERVER_ERROR,
+            detail=f"Échec de soumission du job : {type(exc).__name__}",
+        ) from exc
+    return JobSubmitResponse(job_id=job_id, status="pending")
 @router.get("", response_model=JobListResponse)
 async def list_jobs(request: Request) -> JobListResponse:
     """Liste les jobs (récents en tête)."""

tests/app/services/test_sprint_a14_s48_job_runner.py ADDED Viewed

	@@ -0,0 +1,380 @@

+"""Sprint A14-S48 — ``JobRunner`` + lifespan hook + ``POST /api/jobs``.
+Fix audit #2 : avant ce sprint, ``JobStore`` (S37) était à moitié
+branché — pas de ``POST /api/jobs``, pas de lifespan hook, pas
+d'orchestrateur async.
+Tests couvrent les 3 chantiers :
+1. ``JobRunner`` (service applicatif) :
+   - submit + thread démarré, job marqué ``running`` puis ``complete`` ;
+   - exception orchestrator → ``error`` avec message ;
+   - cancellation pré-démarrage → thread skippe l'exécution ;
+   - cancellation post-démarrage → résultat discardé.
+2. Lifespan hook : ``mark_orphaned_jobs_interrupted`` appelé au boot.
+3. ``POST /api/jobs`` :
+   - YAML valide → 202 + job_id ;
+   - YAML invalide → 400 ;
+   - corps vide → 400 ;
+   - sans job_runner configuré → 503.
+"""
+from __future__ import annotations
+import time
+from pathlib import Path
+from unittest.mock import MagicMock
+import pytest
+from fastapi.testclient import TestClient
+from picarones.adapters.storage import JobStore
+from picarones.app.services import JobRunner
+from picarones.app.services import (
+    RegistryService,
+    WorkspaceManager,
+)
+from picarones.interfaces.web import WebAppState, create_app
+# ──────────────────────────────────────────────────────────────────────
+# Stub orchestrator + factory
+# ──────────────────────────────────────────────────────────────────────
+class _StubOrchestrator:
+    """Stub qui simule un orchestrator : succès, échec, ou délai."""
+    def __init__(
+        self,
+        *,
+        manifest_path: Path,
+        delay_seconds: float = 0.0,
+        raise_on_execute: Exception | None = None,
+    ) -> None:
+        self.manifest_path = manifest_path
+        self.delay_seconds = delay_seconds
+        self.raise_on_execute = raise_on_execute
+        self.execute_called = False
+    def execute(self, run_spec, *, report_renderer=None):
+        self.execute_called = True
+        if self.delay_seconds:
+            time.sleep(self.delay_seconds)
+        if self.raise_on_execute is not None:
+            raise self.raise_on_execute
+        result = MagicMock()
+        result.persisted_files = {"manifest": self.manifest_path}
+        return result
+def _make_factory(stub: _StubOrchestrator):
+    """Retourne une factory `(output_dir) -> stub` pour JobRunner."""
+    def _factory(output_dir):
+        return stub
+    return _factory
+# ──────────────────────────────────────────────────────────────────────
+# JobRunner unitaires
+# ──────────────────────────────────────────────────────────────────────
+class TestJobRunnerConstructor:
+    def test_rejects_non_jobstore(self) -> None:
+        with pytest.raises(TypeError, match="JobStore"):
+            JobRunner(
+                job_store="nope",  # type: ignore[arg-type]
+                orchestrator_factory=lambda d: None,
+            )
+    def test_rejects_non_callable_factory(self, tmp_path: Path) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        with pytest.raises(TypeError, match="orchestrator_factory"):
+            JobRunner(
+                job_store=store,
+                orchestrator_factory="nope",  # type: ignore[arg-type]
+            )
+    def test_rejects_non_callable_renderer(self, tmp_path: Path) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        with pytest.raises(TypeError, match="report_renderer"):
+            JobRunner(
+                job_store=store,
+                orchestrator_factory=lambda d: None,
+                report_renderer="nope",  # type: ignore[arg-type]
+            )
+class TestJobRunnerHappyPath:
+    def test_submit_creates_job_and_marks_complete(self, tmp_path: Path) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        manifest = tmp_path / "manifest.json"
+        manifest.write_text("{}", encoding="utf-8")
+        stub = _StubOrchestrator(manifest_path=manifest)
+        runner = JobRunner(store, _make_factory(stub))
+        job_id = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "run_out",
+        )
+        assert runner.wait(job_id, timeout=5.0)
+        assert stub.execute_called
+        rec = store.get(job_id)
+        assert rec is not None
+        assert rec.status == "complete"
+        assert rec.output_path == str(manifest)
+    def test_submit_returns_unique_uuid_when_no_id(
+        self, tmp_path: Path,
+    ) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        manifest = tmp_path / "manifest.json"
+        manifest.write_text("{}", encoding="utf-8")
+        stub = _StubOrchestrator(manifest_path=manifest)
+        runner = JobRunner(store, _make_factory(stub))
+        job_id_1 = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "out1",
+        )
+        job_id_2 = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "out2",
+        )
+        assert job_id_1 != job_id_2
+        runner.wait(job_id_1, timeout=5.0)
+        runner.wait(job_id_2, timeout=5.0)
+    def test_submit_stores_explicit_job_id(self, tmp_path: Path) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        manifest = tmp_path / "m.json"
+        manifest.write_text("{}", encoding="utf-8")
+        stub = _StubOrchestrator(manifest_path=manifest)
+        runner = JobRunner(store, _make_factory(stub))
+        returned = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "out",
+            job_id="my_explicit_id",
+        )
+        assert returned == "my_explicit_id"
+        runner.wait("my_explicit_id", timeout=5.0)
+        assert store.get("my_explicit_id") is not None
+class TestJobRunnerErrorPath:
+    def test_orchestrator_exception_marks_error(self, tmp_path: Path) -> None:
+        store = JobStore(tmp_path / "jobs.db")
+        stub = _StubOrchestrator(
+            manifest_path=tmp_path / "x",
+            raise_on_execute=RuntimeError("orchestrator boom"),
+        )
+        runner = JobRunner(store, _make_factory(stub))
+        job_id = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "out",
+        )
+        runner.wait(job_id, timeout=5.0)
+        rec = store.get(job_id)
+        assert rec is not None
+        assert rec.status == "error"
+        assert "RuntimeError" in rec.error
+        assert "orchestrator boom" in rec.error
+class TestJobRunnerCancellation:
+    def test_cancel_during_execution_discards_result(
+        self, tmp_path: Path,
+    ) -> None:
+        """Cancel pendant que le worker tourne → le résultat est
+        discardé (statut reste cancelled)."""
+        store = JobStore(tmp_path / "jobs.db")
+        manifest = tmp_path / "m.json"
+        manifest.write_text("{}", encoding="utf-8")
+        # Délai suffisant pour cancel avant complétion.
+        stub = _StubOrchestrator(
+            manifest_path=manifest, delay_seconds=0.3,
+        )
+        runner = JobRunner(store, _make_factory(stub))
+        job_id = runner.submit(
+            run_spec=MagicMock(),
+            output_dir=tmp_path / "out",
+        )
+        # Attendre que mark_running ait été appelé (le thread a démarré).
+        for _ in range(50):
+            time.sleep(0.01)
+            rec = store.get(job_id)
+            if rec is not None and rec.status == "running":
+                break
+        # Cancel en pleine exécution.
+        store.mark_cancelled(job_id)
+        # Attendre la fin du thread (~0.3s).
+        runner.wait(job_id, timeout=5.0)
+        rec_final = store.get(job_id)
+        assert rec_final.status == "cancelled", (
+            f"Status final attendu cancelled, obtenu {rec_final.status}"
+        )
+# ──────────────────────────────────────────────────────────────────────
+# Lifespan hook (mark_orphaned_jobs_interrupted au boot)
+# ──────────────────────────────────────────────────────────────────────
+class TestLifespanHook:
+    def test_orphaned_jobs_marked_interrupted_on_app_start(
+        self, tmp_path: Path,
+    ) -> None:
+        """Pré-condition : un job ``running`` existe dans le store
+        (simule un crash du process précédent).
+        Action : démarrage de l'app FastAPI (lifespan hook).
+        Résultat : le job orphelin est marqué ``interrupted``."""
+        # Phase 1 : pré-pollution du store (simule l'état après crash).
+        db_path = tmp_path / "jobs.db"
+        store = JobStore(db_path)
+        store.create("zombie_pending")
+        store.create("zombie_running")
+        store.mark_running("zombie_running")
+        store.create("complete_one")
+        store.mark_complete("complete_one")
+        # Vérification pré-état.
+        assert store.get("zombie_pending").status == "pending"
+        assert store.get("zombie_running").status == "running"
+        assert store.get("complete_one").status == "complete"
+        # Phase 2 : démarrage de l'app — lifespan hook s'exécute.
+        workspace = WorkspaceManager(base_dir=tmp_path, session_id="s48")
+        registry = RegistryService.bootstrap_defaults()
+        state = WebAppState(
+            workspace=workspace,
+            registry=registry,
+            corpus=MagicMock(),
+            benchmark=MagicMock(),
+            orchestrator=MagicMock(),
+            job_store=store,  # store pré-pollué
+        )
+        app = create_app(state)
+        # Le lifespan hook tourne au context manager du TestClient.
+        with TestClient(app) as client:
+            # Le hook a tourné au démarrage.  On vérifie l'état du store.
+            assert store.get("zombie_pending").status == "interrupted"
+            assert store.get("zombie_running").status == "interrupted"
+            # Les jobs déjà terminaux ne sont pas touchés.
+            assert store.get("complete_one").status == "complete"
+            # Sanity check : l'app répond.
+            assert client.get("/health").status_code == 200
+# ──────────────────────────────────────────────────────────────────────
+# POST /api/jobs (intégration end-to-end via TestClient)
+# ──────────────────────────────────────────────────────────────────────
+def _make_state_with_runner(tmp_path: Path) -> WebAppState:
+    """Construit un WebAppState complet avec JobStore + JobRunner.
+    L'orchestrator est un stub qui complète immédiatement (pour que
+    les tests POST puissent vérifier le statut).
+    """
+    workspace = WorkspaceManager(base_dir=tmp_path, session_id="s48")
+    registry = RegistryService.bootstrap_defaults()
+    job_store = JobStore(tmp_path / "jobs.db")
+    manifest_path = tmp_path / "manifest.json"
+    manifest_path.write_text("{}", encoding="utf-8")
+    # Stub orchestrator factory.
+    def _factory(output_dir):
+        return _StubOrchestrator(manifest_path=manifest_path)
+    job_runner = JobRunner(
+        job_store=job_store,
+        orchestrator_factory=_factory,
+    )
+    return WebAppState(
+        workspace=workspace,
+        registry=registry,
+        corpus=MagicMock(),
+        benchmark=MagicMock(),
+        orchestrator=MagicMock(),
+        job_store=job_store,
+        job_runner=job_runner,
+    )
+_VALID_RUNSPEC_YAML = """
+corpus_dir: /tmp/c
+output_dir: /tmp/out
+pipelines:
+  - name: ocr_only
+    initial_inputs: [image]
+    steps:
+      - id: ocr
+        adapter_class: my_pkg.OCR
+        input_types: [image]
+        output_types: [raw_text]
+views: [text_final]
+""".strip()
+class TestPostJobsEndpoint:
+    def test_valid_yaml_returns_202_with_job_id(self, tmp_path: Path) -> None:
+        state = _make_state_with_runner(tmp_path)
+        app = create_app(state)
+        with TestClient(app) as client:
+            response = client.post("/api/jobs", content=_VALID_RUNSPEC_YAML)
+            assert response.status_code == 202, response.text
+            body = response.json()
+            assert "job_id" in body
+            assert body["status"] == "pending"
+            # Le job_id retourné est dans le store.
+            assert state.job_store.get(body["job_id"]) is not None
+    def test_invalid_yaml_returns_400(self, tmp_path: Path) -> None:
+        state = _make_state_with_runner(tmp_path)
+        app = create_app(state)
+        with TestClient(app) as client:
+            response = client.post(
+                "/api/jobs",
+                content="not a valid runspec yaml: [",
+            )
+            assert response.status_code == 400
+            assert "RunSpec" in response.json()["detail"]
+    def test_empty_body_returns_400_or_422(self, tmp_path: Path) -> None:
+        """Body vide → 400 (notre check) ou 422 (pydantic validation
+        en amont du handler).  Les deux sont acceptables pour
+        l'utilisateur."""
+        state = _make_state_with_runner(tmp_path)
+        app = create_app(state)
+        with TestClient(app) as client:
+            response = client.post("/api/jobs", content="")
+            # FastAPI/Starlette peut valider Body(...) en 422 avant
+            # d'atteindre notre handler ; sinon notre check répond 400.
+            assert response.status_code in (400, 422)
+    def test_no_job_runner_returns_503(self, tmp_path: Path) -> None:
+        """Sans WebAppState.job_runner, POST /api/jobs → 503."""
+        workspace = WorkspaceManager(base_dir=tmp_path, session_id="s48")
+        registry = RegistryService.bootstrap_defaults()
+        state = WebAppState(
+            workspace=workspace,
+            registry=registry,
+            corpus=MagicMock(),
+            benchmark=MagicMock(),
+            orchestrator=MagicMock(),
+            job_store=JobStore(tmp_path / "jobs.db"),
+            # job_runner=None par défaut
+        )
+        app = create_app(state)
+        with TestClient(app) as client:
+            response = client.post("/api/jobs", content=_VALID_RUNSPEC_YAML)
+            assert response.status_code == 503
+            assert "Job runner" in response.json()["detail"]