agentic_document_redaction / skills /Example prompt partnership.txt
seanpedrickcase's picture
Sync Pi agent Space: Merge pull request #199 from seanpedrick-case/startup_optimise
a495708
Raw
History Blame Contribute Delete
17.6 kB
/goal Redact a document, review the suggested redactions, and save the reviewed outputs to file for the user.
| Parameter | Value |
|-------------|---------|
| `{FILE_NAME}` | `Partnership-Agreement-Toolkit_0_0.pdf` |
| `{INPUT_PATH}` | `workspace/{FILE_NAME}` |
| `{OUTPUT_BASE}` | `workspace/redact/{FILE_NAME}/` |
| `{GRADIO_URL}` | `http://host.docker.internal:7861` |
| `{PAGE_RANGE}` | `all` |
| `{VLM_BASE_URL}` | `http://host.docker.internal:8000` (Pass 2 only; append `/v1/chat/completions`) |
| `{VLM_MODEL}` | `Qwen/Qwen3.6-27B-MTP-GGUF` |
---
## Agent task (fixed workflow — do not skip)
Redact and review **`{FILE_NAME}`** from **`{INPUT_PATH}`**.
### Required skills (read before starting — do not improvise)
**Before any API calls**, read the repo skills below in order. They contain endpoint details, download traps, CSV rules, and coverage/prune steps that this prompt does not repeat.
| Phase | Skill | Path | When |
|-------|--------|------|------|
| **1 — Initial redaction** | `doc-redaction-app` | `skills/doc-redaction-app/SKILL.md` | First: `/doc_redact`, download artifacts to `{OUTPUT_BASE}output_redact/` |
| **2 — Pass 1 review** | `doc-redaction-modifications` | `skills/doc-redaction-modifications/SKILL.md` | CSV edits, `verify_redaction_coverage`, suspicious-row prune, `/review_apply`, post-apply checks |
| **3 — Parallel Pass 1 (optional)** | `doc-redact-page-review` | `skills/doc-redact-page-review/SKILL.md` | **Only if** `{PAGE_RANGE}` is large and you split Pass 1 across subagents; parent still merges and applies **once** |
| **Pass 2 VLM (optional)** | `doc-redaction-modifications` § Pass 2 | same file | **Only if** Pass 2 criteria below are met — not for initial review |
**Rules:**
- Follow skill procedures for Gradio client usage, `handle_file`, path validation, and picking newest outputs by `st_mtime`.
- This prompt’s **User redaction requirements** (at the end) override generic examples in the skills for *what* to redact; skills define *how*.
- Do **not** skip reading `doc-redaction-modifications` — it defines Pass 1 completion (`pass_strict`, prune, single apply).
- Repo overview for agents: `AGENTS.md`.
### Agent anti-confusion rules (read before acting)
These rules prevent common LLM mistakes on real runs. **Skills define mechanics; this section defines stop conditions.**
| Confusion | Wrong instinct | Correct approach |
|-----------|----------------|------------------|
| **`/review_apply` purpose** | “Draw overlays only” or “PyMuPDF script = real redaction” | **Only** `/review_apply` (or `/agent/apply_review_redactions`) produces deliverable `*_redacted.pdf` with text stripped. `/preview_boxes` is QA only. |
| **PyMuPDF in Pass 1** | Write a standalone script that redacts and saves the final PDF | **Allowed:** read word/span positions from the **original PDF**, normalize to **0–1**, append rows to `*_review_file.csv`, then **`/review_apply`**. **Forbidden:** saving a “final” PDF without going through `/review_apply`. |
| **How many applies?** | Re-call `/review_apply` after every small CSV tweak | **Batch** CSV fixes from the coverage JSON, then apply. Pass 1: **one** `/review_apply`; **at most one** extra apply only if Pass 2 CSV edits or a single coordinated fix pass still leaves `pass_strict: false`. Do **not** loop apply five+ times chasing the same pages. |
| **When is Pass 1 done?** | “Pixel-black = good enough” while `pass_strict: false` | **Default deliverable gate:** post-apply **`pass_strict: true`**. If you cannot reach it, **stop and report** — do not silently ship (see exception table below). |
| **`text_layer_leaks` + `coord_mismatch_or_image_text`** | Add ever-wider/full-page boxes until `get_text()` is clean | Often **decorative/image overlay** text. Check **`pixel_failures`** and manual pixel samples. If **pixels are black** but text stream still contains the term, **do not** keep adding full-span boxes — document as a known limitation or run **Pass 2** on those pages only. |
| **Coverage regex too narrow** | Only regex the org keyword (e.g. `Lambeth`) when user said “redact any names” | Derive **`must_redact`** from **all** user bullets (names, org terms, faces policy). For “any names”, include a name pattern (e.g. `\b[A-Z][a-z]{2,}(?:\s+[A-Z][a-z]{2,})+\b`) **plus** explicit name tokens from the review CSV / word OCR — not org terms alone. |
| **`verify_redaction_coverage` paths** | Upload CSV via `/gradio_api/upload` and pass `/tmp/gradio_tmp/...` to Agent API; use Pi-container paths on `/agent/*` | **Shared disk:** **`POST /agent/verify_redaction_coverage`** with **server paths** under app `OUTPUT_FOLDER`. **Split container (Pi + redaction service):** **pre-apply** — `python tools/verify_redaction_coverage.py` on **downloaded** CSV/PDF in your session workspace (edited CSV is not on the redaction server); **post-apply** — **`POST /agent/verify_redaction_coverage`** at `{GRADIO_URL}` with **server paths** from `/review_apply` (`extract_server_paths`). All Agent API paths must already exist on the redaction server — never Pi workspace or upload-temp paths. |
| **Draft vs deliverable PDF** | Post-apply checks on `*_redactions_for_review.pdf` or early `*_redacted.pdf` from `/doc_redact` | Verify against **post-apply** `*_redacted.pdf` from **`/review_apply`**, basename ending in `_redacted.pdf`. |
| **Initial `/doc_redact` output** | Treat first-run `*_redacted.pdf` as final | Treat as **draft**. Pass 1 review + **`/review_apply`** is required unless the user explicitly waives review. |
**Delivery exception (only if `pass_strict` cannot be reached after one apply + one fix pass):**
Fill this in the summary markdown — **do not deliver without it** when `pass_strict: false`:
| Field | Value |
|-------|--------|
| **Pages still failing strict** | _(e.g. 3, 5, 7, 19, 21)_ |
| **`leak_likely_causes`** | _(from coverage JSON per page)_ |
| **`pixel_failures` count** | _(0 = visually covered; >0 = real visual leak)_ |
| **User decision needed** | Accept text-stream artifact / run Pass 2 VLM on listed pages / manual review |
### Two-pass model (Pass 1 is the deliverable)
**Default: Pass 1 only.** Pass 1 must be sufficient for delivery unless Pass 2 criteria below are met after Pass 1 completes.
**Do not run VLM on every page.** Do not spawn per-page VLM subagents unless I explicitly request Pass 2 or Pass 1 leaves pages in `pages_flagged_for_vlm`.
---
### Pass 1 (required — complete end-to-end)
1. **Initial redaction** — `POST /doc_redact` (or Gradio `api_name="/doc_redact"`) with settings from **User redaction requirements** below. Save artifacts to **`{OUTPUT_BASE}output_redact/`**.
2. **Review all pages in Pass 1** using **OCR / CSV / text only** (no VLM):
- Load `*_review_file.csv`, `*_ocr_results_with_words_*.csv`, `*_ocr_output_*.csv`, original PDF from the same run.
- Apply **User redaction requirements** (must redact / must not redact) programmatically to the review CSV.
- Align missing boxes from word OCR (merge same-line tokens; separate boxes across lines). Use PyMuPDF positions only to **author CSV rows** (normalized 0–1), not to write the final PDF directly.
- Run **`verify_redaction_coverage`** (CLI or `POST /agent/verify_redaction_coverage`) with `must_redact` / `must_not_redact` regex lists derived from **every** user requirement bullet (not just org keywords).
- Fix policy issues until **`pass_strict: true`** (`uncovered_terms`, `over_redacted`, `text_layer_leaks` cleared) — **batch fixes**, then re-verify; avoid re-applying after each row.
- **Prune suspicious rows** — short OCR fragments that do not match `must_redact` (`auto_prune_suspicious: true` or `--prune-suspicious`). Re-run coverage; target **`pass_with_cleanup: true`**.
- Optional: `/preview_boxes` on highest-risk pages only (not every page).
3. **Apply (minimal calls)** — **`/review_apply`** once from the parent agent (original PDF + merged/pruned CSV). Download newest outputs by `st_mtime` to **`{OUTPUT_BASE}review/output_review_final/`**. **At most one** additional `/review_apply` if Pass 2 edits the CSV or a single coordinated fix pass is still required; see **Agent anti-confusion rules**.
4. **Post-apply verification**
- Re-run `verify_redaction_coverage` with `redacted_pdf_path`.
- Optional term search (`POST /agent/word_level_ocr_text_search`) for key names from user requirements.
5. **Pass 1 completion criteria**
- Post-apply coverage: **`pass_strict: true`** (required unless **Delivery exception** table in summary is filled — see anti-confusion rules)
- Practitioner / allow-list names not over-redacted (per user requirements)
- Deliverable PDF is **post-apply** `*_redacted.pdf` under **`{OUTPUT_BASE}review/output_review_final/`**
- Write a brief summary markdown under **`{OUTPUT_BASE}review/`** (what was done, coverage results, any pages still needing optional Pass 2 or user sign-off)
**Hard gates:** (1) deliverable = post-apply `*_redacted.pdf` only via `/review_apply`; (2) `pass_strict: true` required unless you fill the Delivery exception table; (3) at most two `/review_apply` calls total in Pass 1; (4) batch CSV fixes — no apply-per-row loops; (5) PyMuPDF may only help write CSV rows, not replace `/review_apply`.
---
### Pass 2 (optional — strict gate)
**Do not start Pass 2 unless one of the following is true:**
| Criterion | Action |
|-----------|--------|
| I explicitly ask for visual / VLM review | Run Pass 2 only on pages or range I specify, or on `pages_flagged_for_vlm` if I say “flagged pages only” |
| Post-apply coverage lists **`pages_flagged_for_vlm`** | VLM **only those pages** (sequential, max 1 concurrent on local VLM) |
| Page has **`uncovered_terms`** for must-redact regex after Pass 1 fixes | Targeted Pass 2 on that page |
| Page has **`text_layer_leaks`** or **`pixel_failures`** | Targeted Pass 2 on that page. If **`leak_likely_causes`** is `coord_mismatch_or_image_text` and **`pixel_failures` is empty**, prefer Pass 2 visual check over repeated full-span CSV boxes. |
| Handwriting, stamps, signatures, or ink **absent from word OCR** and suspected to contain policy PII | Targeted Pass 2 on that page after noting why in the summary |
**Do not run Pass 2 for:**
- **`pages_needing_csv_cleanup` alone** — fix with suspicious-row prune, not VLM
- Suspicious short OCR rows (`"-"`, `"."`, `"Ho"`, etc.)
- “Review every page visually for completeness” on large documents
- Bulk-adding every OCR token from VLM output (conservative CSV edits only)
If Pass 2 runs: render PNGs for flagged pages only → one VLM call per page → conservative CSV patch → **at most one** additional `/review_apply` → re-run Pass 1 coverage.
### Deployment — Pass 2 VLM (use only if Pass 2 criteria above are met)
**Agent / operator fills this block** — not the end-user’s redaction policy. Use when Pass 2 visual review may run (e.g. face photos in scans, flagged pages after coverage). **Pass 1 does not call this endpoint.**
Omit this entire section or write **“N/A — no VLM deployed”** if Pass 2 will never run.
| Setting | Value |
|---------|--------|
| **Base URL** | `{VLM_BASE_URL}` |
| **Model** | `{VLM_MODEL}` |
| **API key** | _(e.g. `none` for local vLLM, or env var name like `VLM_API_KEY` — **do not paste secrets in chat**)_ |
| **Timeout (s)** | _(e.g. `240`)_ |
| **max_tokens** | _(e.g. `2500`)_ |
| **Notes** | _(e.g. reasoning model — check `reasoning_content`; sequential one page at a time)_ |
For **simple user sections** (non-expert bullets only), keep VLM connection details here — not under **User redaction requirements**. For **detailed user sections**, you may duplicate or move settings to **Pass 2 VLM endpoint** below instead.
---
### Technical constraints
- Gradio: **`{GRADIO_URL}`**
- Page scope: **`{PAGE_RANGE}`**
- Review CSV basename must contain `_review_file`
- CSV encoding: **`utf-8-sig`**
- Reuse same-page `image` column value when adding rows
- Long `httpx` read timeout for large PDFs (e.g. 1800s+)
- Human review of redacted material is still assumed downstream
- **Do not** bypass `/review_apply` with custom PyMuPDF “true redaction” scripts when `text_layer_leaks` appear — fix CSV boxes/coordinates and re-apply (see `doc-redaction-modifications` § Endpoint semantics)
- If resuming a partial run: record which `*_redacted.pdf` is post-apply, re-run `verify_redaction_coverage` with that path (do not trust stale JSON without checking the PDF it was run against)
- If the deployment has `POST_REDACT_PASS1_QA=True`, initial redaction may already emit `*_coverage_report.json` (and optional `*_review_file_pruned.csv`) — that is **deployment sanity QA**, not a substitute for this task's Pass 1 review and `/review_apply`
- Use `{DEFAULT_OCR_METHOD}` for the initial text extraction, and `{DEFAULT_PII_METHOD}` for the PII identification model, unless the user specifies otherwise. Use a very long timeout with this method - allow for two minutes to complete each page.
- {REMOTE_BACKEND_GUIDANCE}
- {VLM_FACES_GUIDANCE}
- {VLM_SIGNATURE_GUIDANCE}
- Ensure that all redaction boxes cover genuine PII and are not false positives, unless the terms are specified by the user for redaction.
- If you get stuck in a loop unable to redact, review pages, or apply modified redactions to a document, please refer back to the relevant skill (doc-redaction-app, or doc-redaction_modifications) for instructions on how to use the app properly.
- Ensure that you save all the relevant output files to the output folder, including the redacted PDF, the review PDF, and the CSV review file.
---
## Specific rules for long documents (100+ pages — operator / agent rules)
Use this section when **`{PAGE_RANGE}`** is **`all`** or spans **100+ pages** (e.g. 500-page bundles). Pass 1 remains the deliverable; Pass 2 stays **flagged-pages-only**.
### When to activate
- Page count **≥ 100**, or expected OCR CSV **> 50k word rows**, or a single `/doc_redact` call is likely to exceed **~30 minutes**.
- Set **`{PAGE_RANGE}`** explicitly (e.g. `1-500`) so subagents and coverage know scope.
### Programmatic review (mandatory at scale)
**Never load the full `*_review_file.csv` or word OCR CSV into chat context.** At 500 pages these files can exceed model context limits.
Instead:
1. Edit CSVs with **Python/pandas scripts** on disk (regex policy, merge word boxes, add rows).
2. Run **`verify_redaction_coverage`** via CLI or **`POST /agent/verify_redaction_coverage`**; read the **JSON report** only (`pages_with_policy_issues`, per-page flags).
3. Fix **only flagged pages**; batch edits before re-running coverage (avoid verify-after-every-row).
4. Use **`--prune-suspicious`** / `auto_prune_suspicious: true` once policy passes are stable.
Parallel page subagents ([`doc-redact-page-review`](../doc-redact-page-review/SKILL.md)): spawn for **policy-flagged pages** or page-specific rules — not necessarily all 500 pages. Batch **3–5 concurrent** children; parent **merges once** and calls **`/review_apply` once**.
### Initial redaction — prefer CLI or chunks
| Approach | When |
|----------|------|
| **`python cli_redact.py`** (or **`POST /agent/redact_document`**) | Multi-hour jobs; avoids Gradio HTTP read timeouts |
| **Chunked `/doc_redact`** (`page_min` / `page_max`, e.g. 100 pages per run) | Single call would exceed client timeout or server RAM |
| **Single `/doc_redact`** | ≤ ~100 pages or mostly text with `EFFICIENT_OCR=True` and long `httpx` read timeout (1800s+) |
After chunked redact: merge review CSVs with **`combine_review_csvs`** (Agent API / Gradio), then **one** `/review_apply` on the **full original PDF**.
**Deployment toggles** (see `tools/config.py`, `config/app_config.env`):
- **`EFFICIENT_OCR=True`** — text extraction first; OCR only on sparse/image pages.
- **`OCR_FIRST_PASS_MAX_WORKERS`**, **`PADDLE_MAX_WORKERS`** — tune CPU/GPU parallelism.
- **`POST_REDACT_PASS1_QA=True`** — optional post-redact coverage + prune (sanity QA only; not a substitute for this task's Pass 1 review).
### Tiered Pass 1 workflow (500+ pages)
1. Initial redact (CLI or chunked) → save under **`{OUTPUT_BASE}output_redact/`**.
2. Programmatic regex pass on full review CSV.
3. **`verify_redaction_coverage`** → fix pages in `pages_with_policy_issues`.
4. Prune suspicious rows → re-verify until **`pass_with_cleanup: true`**.
5. **One** `/review_apply` → post-apply verify with `redacted_pdf_path`.
6. Optional: **`find_duplicate_pages`** — review one page per duplicate cluster, propagate CSV patterns.
7. Pass 2 VLM **only** on `pages_flagged_for_vlm` or user-specified ranges — **never** full-document visual sweep.
### Resume checkpoints
If the agent session dies mid-task:
- After step 1: resume from CSV edit + verify (artifacts in `{OUTPUT_BASE}output_redact/`).
- After apply: record which `*_redacted.pdf` is post-apply; re-run verify against that path (do not trust stale JSON).
### Pass 2 at scale
Do **not** run VLM on every page. At ~1–2 min/page sequential on local VLMs, 500 pages ≈ **8–17 hours** plus large image-token cost. Use Deployment — Pass 2 VLM only for explicit flagged subsets.
---
## User redaction requirements (authoritative for this task)
- All signatures should be redacted
- Any redaction box related to general country names should be removed
- All redactions for Rudy Giuliani should be removed
- All mentions of London, and 'Sister City' should be redacted