| /goal Redact a document, review the suggested redactions, and save the reviewed outputs to file for the user. |
|
|
| | Parameter | Value | |
| |-------------|---------| |
| | `{FILE_NAME}` | `Partnership-Agreement-Toolkit_0_0.pdf` | |
| | `{INPUT_PATH}` | `workspace/{FILE_NAME}` | |
| | `{OUTPUT_BASE}` | `workspace/redact/{FILE_NAME}/` | |
| | `{GRADIO_URL}` | `http://host.docker.internal:7861` | |
| | `{PAGE_RANGE}` | `all` | |
| | `{VLM_BASE_URL}` | `http://host.docker.internal:8000` (Pass 2 only; append `/v1/chat/completions`) | |
| | `{VLM_MODEL}` | `Qwen/Qwen3.6-27B-MTP-GGUF` | |
|
|
| --- |
|
|
| ## Agent task (fixed workflow — do not skip) |
|
|
| Redact and review **`{FILE_NAME}`** from **`{INPUT_PATH}`**. |
|
|
| ### Required skills (read before starting — do not improvise) |
|
|
| **Before any API calls**, read the repo skills below in order. They contain endpoint details, download traps, CSV rules, and coverage/prune steps that this prompt does not repeat. |
|
|
| | Phase | Skill | Path | When | |
| |-------|--------|------|------| |
| | **1 — Initial redaction** | `doc-redaction-app` | `skills/doc-redaction-app/SKILL.md` | First: `/doc_redact`, download artifacts to `{OUTPUT_BASE}output_redact/` | |
| | **2 — Pass 1 review** | `doc-redaction-modifications` | `skills/doc-redaction-modifications/SKILL.md` | CSV edits, `verify_redaction_coverage`, suspicious-row prune, `/review_apply`, post-apply checks | |
| | **3 — Parallel Pass 1 (optional)** | `doc-redact-page-review` | `skills/doc-redact-page-review/SKILL.md` | **Only if** `{PAGE_RANGE}` is large and you split Pass 1 across subagents; parent still merges and applies **once** | |
| | **Pass 2 VLM (optional)** | `doc-redaction-modifications` § Pass 2 | same file | **Only if** Pass 2 criteria below are met — not for initial review | |
|
|
| **Rules:** |
|
|
| - Follow skill procedures for Gradio client usage, `handle_file`, path validation, and picking newest outputs by `st_mtime`. |
| - This prompt’s **User redaction requirements** (at the end) override generic examples in the skills for *what* to redact; skills define *how*. |
| - Do **not** skip reading `doc-redaction-modifications` — it defines Pass 1 completion (`pass_strict`, prune, single apply). |
| - Repo overview for agents: `AGENTS.md`. |
|
|
|
|
| ### Agent anti-confusion rules (read before acting) |
|
|
| These rules prevent common LLM mistakes on real runs. **Skills define mechanics; this section defines stop conditions.** |
|
|
| | Confusion | Wrong instinct | Correct approach | |
| |-----------|----------------|------------------| |
| | **`/review_apply` purpose** | “Draw overlays only” or “PyMuPDF script = real redaction” | **Only** `/review_apply` (or `/agent/apply_review_redactions`) produces deliverable `*_redacted.pdf` with text stripped. `/preview_boxes` is QA only. | |
| | **PyMuPDF in Pass 1** | Write a standalone script that redacts and saves the final PDF | **Allowed:** read word/span positions from the **original PDF**, normalize to **0–1**, append rows to `*_review_file.csv`, then **`/review_apply`**. **Forbidden:** saving a “final” PDF without going through `/review_apply`. | |
| | **How many applies?** | Re-call `/review_apply` after every small CSV tweak | **Batch** CSV fixes from the coverage JSON, then apply. Pass 1: **one** `/review_apply`; **at most one** extra apply only if Pass 2 CSV edits or a single coordinated fix pass still leaves `pass_strict: false`. Do **not** loop apply five+ times chasing the same pages. | |
| | **When is Pass 1 done?** | “Pixel-black = good enough” while `pass_strict: false` | **Default deliverable gate:** post-apply **`pass_strict: true`**. If you cannot reach it, **stop and report** — do not silently ship (see exception table below). | |
| | **`text_layer_leaks` + `coord_mismatch_or_image_text`** | Add ever-wider/full-page boxes until `get_text()` is clean | Often **decorative/image overlay** text. Check **`pixel_failures`** and manual pixel samples. If **pixels are black** but text stream still contains the term, **do not** keep adding full-span boxes — document as a known limitation or run **Pass 2** on those pages only. | |
| | **Coverage regex too narrow** | Only regex the org keyword (e.g. `Lambeth`) when user said “redact any names” | Derive **`must_redact`** from **all** user bullets (names, org terms, faces policy). For “any names”, include a name pattern (e.g. `\b[A-Z][a-z]{2,}(?:\s+[A-Z][a-z]{2,})+\b`) **plus** explicit name tokens from the review CSV / word OCR — not org terms alone. | |
| | **`verify_redaction_coverage` paths** | Upload CSV via `/gradio_api/upload` and pass `/tmp/gradio_tmp/...` to Agent API; use Pi-container paths on `/agent/*` | **Shared disk:** **`POST /agent/verify_redaction_coverage`** with **server paths** under app `OUTPUT_FOLDER`. **Split container (Pi + redaction service):** **pre-apply** — `python tools/verify_redaction_coverage.py` on **downloaded** CSV/PDF in your session workspace (edited CSV is not on the redaction server); **post-apply** — **`POST /agent/verify_redaction_coverage`** at `{GRADIO_URL}` with **server paths** from `/review_apply` (`extract_server_paths`). All Agent API paths must already exist on the redaction server — never Pi workspace or upload-temp paths. | |
| | **Draft vs deliverable PDF** | Post-apply checks on `*_redactions_for_review.pdf` or early `*_redacted.pdf` from `/doc_redact` | Verify against **post-apply** `*_redacted.pdf` from **`/review_apply`**, basename ending in `_redacted.pdf`. | |
| | **Initial `/doc_redact` output** | Treat first-run `*_redacted.pdf` as final | Treat as **draft**. Pass 1 review + **`/review_apply`** is required unless the user explicitly waives review. | |
|
|
| **Delivery exception (only if `pass_strict` cannot be reached after one apply + one fix pass):** |
|
|
| Fill this in the summary markdown — **do not deliver without it** when `pass_strict: false`: |
|
|
| | Field | Value | |
| |-------|--------| |
| | **Pages still failing strict** | _(e.g. 3, 5, 7, 19, 21)_ | |
| | **`leak_likely_causes`** | _(from coverage JSON per page)_ | |
| | **`pixel_failures` count** | _(0 = visually covered; >0 = real visual leak)_ | |
| | **User decision needed** | Accept text-stream artifact / run Pass 2 VLM on listed pages / manual review | |
|
|
| ### Two-pass model (Pass 1 is the deliverable) |
|
|
| **Default: Pass 1 only.** Pass 1 must be sufficient for delivery unless Pass 2 criteria below are met after Pass 1 completes. |
|
|
| **Do not run VLM on every page.** Do not spawn per-page VLM subagents unless I explicitly request Pass 2 or Pass 1 leaves pages in `pages_flagged_for_vlm`. |
|
|
| --- |
|
|
| ### Pass 1 (required — complete end-to-end) |
|
|
| 1. **Initial redaction** — `POST /doc_redact` (or Gradio `api_name="/doc_redact"`) with settings from **User redaction requirements** below. Save artifacts to **`{OUTPUT_BASE}output_redact/`**. |
|
|
| 2. **Review all pages in Pass 1** using **OCR / CSV / text only** (no VLM): |
| - Load `*_review_file.csv`, `*_ocr_results_with_words_*.csv`, `*_ocr_output_*.csv`, original PDF from the same run. |
| - Apply **User redaction requirements** (must redact / must not redact) programmatically to the review CSV. |
| - Align missing boxes from word OCR (merge same-line tokens; separate boxes across lines). Use PyMuPDF positions only to **author CSV rows** (normalized 0–1), not to write the final PDF directly. |
| - Run **`verify_redaction_coverage`** (CLI or `POST /agent/verify_redaction_coverage`) with `must_redact` / `must_not_redact` regex lists derived from **every** user requirement bullet (not just org keywords). |
| - Fix policy issues until **`pass_strict: true`** (`uncovered_terms`, `over_redacted`, `text_layer_leaks` cleared) — **batch fixes**, then re-verify; avoid re-applying after each row. |
| - **Prune suspicious rows** — short OCR fragments that do not match `must_redact` (`auto_prune_suspicious: true` or `--prune-suspicious`). Re-run coverage; target **`pass_with_cleanup: true`**. |
| - Optional: `/preview_boxes` on highest-risk pages only (not every page). |
|
|
| 3. **Apply (minimal calls)** — **`/review_apply`** once from the parent agent (original PDF + merged/pruned CSV). Download newest outputs by `st_mtime` to **`{OUTPUT_BASE}review/output_review_final/`**. **At most one** additional `/review_apply` if Pass 2 edits the CSV or a single coordinated fix pass is still required; see **Agent anti-confusion rules**. |
|
|
| 4. **Post-apply verification** |
| - Re-run `verify_redaction_coverage` with `redacted_pdf_path`. |
| - Optional term search (`POST /agent/word_level_ocr_text_search`) for key names from user requirements. |
|
|
| 5. **Pass 1 completion criteria** |
| - Post-apply coverage: **`pass_strict: true`** (required unless **Delivery exception** table in summary is filled — see anti-confusion rules) |
| - Practitioner / allow-list names not over-redacted (per user requirements) |
| - Deliverable PDF is **post-apply** `*_redacted.pdf` under **`{OUTPUT_BASE}review/output_review_final/`** |
| - Write a brief summary markdown under **`{OUTPUT_BASE}review/`** (what was done, coverage results, any pages still needing optional Pass 2 or user sign-off) |
|
|
| **Hard gates:** (1) deliverable = post-apply `*_redacted.pdf` only via `/review_apply`; (2) `pass_strict: true` required unless you fill the Delivery exception table; (3) at most two `/review_apply` calls total in Pass 1; (4) batch CSV fixes — no apply-per-row loops; (5) PyMuPDF may only help write CSV rows, not replace `/review_apply`. |
|
|
| --- |
|
|
| ### Pass 2 (optional — strict gate) |
|
|
| **Do not start Pass 2 unless one of the following is true:** |
|
|
| | Criterion | Action | |
| |-----------|--------| |
| | I explicitly ask for visual / VLM review | Run Pass 2 only on pages or range I specify, or on `pages_flagged_for_vlm` if I say “flagged pages only” | |
| | Post-apply coverage lists **`pages_flagged_for_vlm`** | VLM **only those pages** (sequential, max 1 concurrent on local VLM) | |
| | Page has **`uncovered_terms`** for must-redact regex after Pass 1 fixes | Targeted Pass 2 on that page | |
| | Page has **`text_layer_leaks`** or **`pixel_failures`** | Targeted Pass 2 on that page. If **`leak_likely_causes`** is `coord_mismatch_or_image_text` and **`pixel_failures` is empty**, prefer Pass 2 visual check over repeated full-span CSV boxes. | |
| | Handwriting, stamps, signatures, or ink **absent from word OCR** and suspected to contain policy PII | Targeted Pass 2 on that page after noting why in the summary | |
|
|
| **Do not run Pass 2 for:** |
|
|
| - **`pages_needing_csv_cleanup` alone** — fix with suspicious-row prune, not VLM |
| - Suspicious short OCR rows (`"-"`, `"."`, `"Ho"`, etc.) |
| - “Review every page visually for completeness” on large documents |
| - Bulk-adding every OCR token from VLM output (conservative CSV edits only) |
|
|
| If Pass 2 runs: render PNGs for flagged pages only → one VLM call per page → conservative CSV patch → **at most one** additional `/review_apply` → re-run Pass 1 coverage. |
|
|
| ### Deployment — Pass 2 VLM (use only if Pass 2 criteria above are met) |
|
|
| **Agent / operator fills this block** — not the end-user’s redaction policy. Use when Pass 2 visual review may run (e.g. face photos in scans, flagged pages after coverage). **Pass 1 does not call this endpoint.** |
|
|
| Omit this entire section or write **“N/A — no VLM deployed”** if Pass 2 will never run. |
|
|
| | Setting | Value | |
| |---------|--------| |
| | **Base URL** | `{VLM_BASE_URL}` | |
| | **Model** | `{VLM_MODEL}` | |
| | **API key** | _(e.g. `none` for local vLLM, or env var name like `VLM_API_KEY` — **do not paste secrets in chat**)_ | |
| | **Timeout (s)** | _(e.g. `240`)_ | |
| | **max_tokens** | _(e.g. `2500`)_ | |
| | **Notes** | _(e.g. reasoning model — check `reasoning_content`; sequential one page at a time)_ | |
|
|
| For **simple user sections** (non-expert bullets only), keep VLM connection details here — not under **User redaction requirements**. For **detailed user sections**, you may duplicate or move settings to **Pass 2 VLM endpoint** below instead. |
|
|
| --- |
|
|
| ### Technical constraints |
|
|
| - Gradio: **`{GRADIO_URL}`** |
| - Page scope: **`{PAGE_RANGE}`** |
| - Review CSV basename must contain `_review_file` |
| - CSV encoding: **`utf-8-sig`** |
| - Reuse same-page `image` column value when adding rows |
| - Long `httpx` read timeout for large PDFs (e.g. 1800s+) |
| - Human review of redacted material is still assumed downstream |
| - **Do not** bypass `/review_apply` with custom PyMuPDF “true redaction” scripts when `text_layer_leaks` appear — fix CSV boxes/coordinates and re-apply (see `doc-redaction-modifications` § Endpoint semantics) |
| - If resuming a partial run: record which `*_redacted.pdf` is post-apply, re-run `verify_redaction_coverage` with that path (do not trust stale JSON without checking the PDF it was run against) |
| - If the deployment has `POST_REDACT_PASS1_QA=True`, initial redaction may already emit `*_coverage_report.json` (and optional `*_review_file_pruned.csv`) — that is **deployment sanity QA**, not a substitute for this task's Pass 1 review and `/review_apply` |
| - Use `{DEFAULT_OCR_METHOD}` for the initial text extraction, and `{DEFAULT_PII_METHOD}` for the PII identification model, unless the user specifies otherwise. Use a very long timeout with this method - allow for two minutes to complete each page. |
| - {REMOTE_BACKEND_GUIDANCE} |
| - {VLM_FACES_GUIDANCE} |
| - {VLM_SIGNATURE_GUIDANCE} |
| - Ensure that all redaction boxes cover genuine PII and are not false positives, unless the terms are specified by the user for redaction. |
| - If you get stuck in a loop unable to redact, review pages, or apply modified redactions to a document, please refer back to the relevant skill (doc-redaction-app, or doc-redaction_modifications) for instructions on how to use the app properly. |
| - Ensure that you save all the relevant output files to the output folder, including the redacted PDF, the review PDF, and the CSV review file. |
|
|
| --- |
|
|
| ## Specific rules for long documents (100+ pages — operator / agent rules) |
|
|
| Use this section when **`{PAGE_RANGE}`** is **`all`** or spans **100+ pages** (e.g. 500-page bundles). Pass 1 remains the deliverable; Pass 2 stays **flagged-pages-only**. |
|
|
| ### When to activate |
|
|
| - Page count **≥ 100**, or expected OCR CSV **> 50k word rows**, or a single `/doc_redact` call is likely to exceed **~30 minutes**. |
| - Set **`{PAGE_RANGE}`** explicitly (e.g. `1-500`) so subagents and coverage know scope. |
|
|
| ### Programmatic review (mandatory at scale) |
|
|
| **Never load the full `*_review_file.csv` or word OCR CSV into chat context.** At 500 pages these files can exceed model context limits. |
|
|
| Instead: |
|
|
| 1. Edit CSVs with **Python/pandas scripts** on disk (regex policy, merge word boxes, add rows). |
| 2. Run **`verify_redaction_coverage`** via CLI or **`POST /agent/verify_redaction_coverage`**; read the **JSON report** only (`pages_with_policy_issues`, per-page flags). |
| 3. Fix **only flagged pages**; batch edits before re-running coverage (avoid verify-after-every-row). |
| 4. Use **`--prune-suspicious`** / `auto_prune_suspicious: true` once policy passes are stable. |
|
|
| Parallel page subagents ([`doc-redact-page-review`](../doc-redact-page-review/SKILL.md)): spawn for **policy-flagged pages** or page-specific rules — not necessarily all 500 pages. Batch **3–5 concurrent** children; parent **merges once** and calls **`/review_apply` once**. |
|
|
| ### Initial redaction — prefer CLI or chunks |
|
|
| | Approach | When | |
| |----------|------| |
| | **`python cli_redact.py`** (or **`POST /agent/redact_document`**) | Multi-hour jobs; avoids Gradio HTTP read timeouts | |
| | **Chunked `/doc_redact`** (`page_min` / `page_max`, e.g. 100 pages per run) | Single call would exceed client timeout or server RAM | |
| | **Single `/doc_redact`** | ≤ ~100 pages or mostly text with `EFFICIENT_OCR=True` and long `httpx` read timeout (1800s+) | |
|
|
| After chunked redact: merge review CSVs with **`combine_review_csvs`** (Agent API / Gradio), then **one** `/review_apply` on the **full original PDF**. |
|
|
| **Deployment toggles** (see `tools/config.py`, `config/app_config.env`): |
|
|
| - **`EFFICIENT_OCR=True`** — text extraction first; OCR only on sparse/image pages. |
| - **`OCR_FIRST_PASS_MAX_WORKERS`**, **`PADDLE_MAX_WORKERS`** — tune CPU/GPU parallelism. |
| - **`POST_REDACT_PASS1_QA=True`** — optional post-redact coverage + prune (sanity QA only; not a substitute for this task's Pass 1 review). |
|
|
| ### Tiered Pass 1 workflow (500+ pages) |
|
|
| 1. Initial redact (CLI or chunked) → save under **`{OUTPUT_BASE}output_redact/`**. |
| 2. Programmatic regex pass on full review CSV. |
| 3. **`verify_redaction_coverage`** → fix pages in `pages_with_policy_issues`. |
| 4. Prune suspicious rows → re-verify until **`pass_with_cleanup: true`**. |
| 5. **One** `/review_apply` → post-apply verify with `redacted_pdf_path`. |
| 6. Optional: **`find_duplicate_pages`** — review one page per duplicate cluster, propagate CSV patterns. |
| 7. Pass 2 VLM **only** on `pages_flagged_for_vlm` or user-specified ranges — **never** full-document visual sweep. |
|
|
| ### Resume checkpoints |
|
|
| If the agent session dies mid-task: |
|
|
| - After step 1: resume from CSV edit + verify (artifacts in `{OUTPUT_BASE}output_redact/`). |
| - After apply: record which `*_redacted.pdf` is post-apply; re-run verify against that path (do not trust stale JSON). |
|
|
| ### Pass 2 at scale |
|
|
| Do **not** run VLM on every page. At ~1–2 min/page sequential on local VLMs, 500 pages ≈ **8–17 hours** plus large image-token cost. Use Deployment — Pass 2 VLM only for explicit flagged subsets. |
|
|
| --- |
|
|
| ## User redaction requirements (authoritative for this task) |
|
|
| - All signatures should be redacted |
| - Any redaction box related to general country names should be removed |
| - All redactions for Rudy Giuliani should be removed |
| - All mentions of London, and 'Sister City' should be redacted |
|
|