Spaces:

seanpedrickcase
/

agentic_document_redaction

Running

App Files Files Community

agentic_document_redaction / skills /Example prompt partnership.txt

seanpedrickcase

Sync Pi agent Space: Merge pull request #199 from seanpedrick-case/startup_optimise

a495708 7 days ago

Raw

History Blame Contribute Delete

17.6 kB

	/goal Redact a document, review the suggested redactions, and save the reviewed outputs to file for the user.

	\| Parameter \| Value \|
	\|-------------\|---------\|
	\| `{FILE_NAME}` \| `Partnership-Agreement-Toolkit_0_0.pdf` \|
	\| `{INPUT_PATH}` \| `workspace/{FILE_NAME}` \|
	\| `{OUTPUT_BASE}` \| `workspace/redact/{FILE_NAME}/` \|
	\| `{GRADIO_URL}` \| `http://host.docker.internal:7861` \|
	\| `{PAGE_RANGE}` \| `all` \|
	\| `{VLM_BASE_URL}` \| `http://host.docker.internal:8000` (Pass 2 only; append `/v1/chat/completions`) \|
	\| `{VLM_MODEL}` \| `Qwen/Qwen3.6-27B-MTP-GGUF` \|

	---

	## Agent task (fixed workflow — do not skip)

	Redact and review `{FILE_NAME}` from `{INPUT_PATH}`.

	### Required skills (read before starting — do not improvise)

	Before any API calls, read the repo skills below in order. They contain endpoint details, download traps, CSV rules, and coverage/prune steps that this prompt does not repeat.

	\| Phase \| Skill \| Path \| When \|
	\|-------\|--------\|------\|------\|
	\| 1 — Initial redaction \| `doc-redaction-app` \| `skills/doc-redaction-app/SKILL.md` \| First: `/doc_redact`, download artifacts to `{OUTPUT_BASE}output_redact/` \|
	\| 2 — Pass 1 review \| `doc-redaction-modifications` \| `skills/doc-redaction-modifications/SKILL.md` \| CSV edits, `verify_redaction_coverage`, suspicious-row prune, `/review_apply`, post-apply checks \|
	\| 3 — Parallel Pass 1 (optional) \| `doc-redact-page-review` \| `skills/doc-redact-page-review/SKILL.md` \| Only if `{PAGE_RANGE}` is large and you split Pass 1 across subagents; parent still merges and applies once \|
	\| Pass 2 VLM (optional) \| `doc-redaction-modifications` § Pass 2 \| same file \| Only if Pass 2 criteria below are met — not for initial review \|

	Rules:

	- Follow skill procedures for Gradio client usage, `handle_file`, path validation, and picking newest outputs by `st_mtime`.
	- This prompt’s User redaction requirements (at the end) override generic examples in the skills for what to redact; skills define how.
	- Do not skip reading `doc-redaction-modifications` — it defines Pass 1 completion (`pass_strict`, prune, single apply).
	- Repo overview for agents: `AGENTS.md`.


	### Agent anti-confusion rules (read before acting)

	These rules prevent common LLM mistakes on real runs. Skills define mechanics; this section defines stop conditions.

	\| Confusion \| Wrong instinct \| Correct approach \|
	\|-----------\|----------------\|------------------\|
	\| `/review_apply` purpose \| “Draw overlays only” or “PyMuPDF script = real redaction” \| Only `/review_apply` (or `/agent/apply_review_redactions`) produces deliverable `*_redacted.pdf` with text stripped. `/preview_boxes` is QA only. \|
	\| PyMuPDF in Pass 1 \| Write a standalone script that redacts and saves the final PDF \| Allowed: read word/span positions from the original PDF, normalize to 0–1, append rows to `_review_file.csv`, then `/review_apply`. Forbidden:* saving a “final” PDF without going through `/review_apply`. \|
	\| How many applies? \| Re-call `/review_apply` after every small CSV tweak \| Batch CSV fixes from the coverage JSON, then apply. Pass 1: one `/review_apply`; at most one extra apply only if Pass 2 CSV edits or a single coordinated fix pass still leaves `pass_strict: false`. Do not loop apply five+ times chasing the same pages. \|
	\| When is Pass 1 done? \| “Pixel-black = good enough” while `pass_strict: false` \| Default deliverable gate: post-apply `pass_strict: true`. If you cannot reach it, stop and report — do not silently ship (see exception table below). \|
	\| `text_layer_leaks` + `coord_mismatch_or_image_text` \| Add ever-wider/full-page boxes until `get_text()` is clean \| Often decorative/image overlay text. Check `pixel_failures` and manual pixel samples. If pixels are black but text stream still contains the term, do not keep adding full-span boxes — document as a known limitation or run Pass 2 on those pages only. \|
	\| Coverage regex too narrow \| Only regex the org keyword (e.g. `Lambeth`) when user said “redact any names” \| Derive `must_redact` from all user bullets (names, org terms, faces policy). For “any names”, include a name pattern (e.g. `\b[A-Z][a-z]{2,}(?:\s+[A-Z][a-z]{2,})+\b`) plus explicit name tokens from the review CSV / word OCR — not org terms alone. \|
	\| `verify_redaction_coverage` paths \| Upload CSV via `/gradio_api/upload` and pass `/tmp/gradio_tmp/...` to Agent API; use Pi-container paths on `/agent/` \| Shared disk:* `POST /agent/verify_redaction_coverage` with server paths under app `OUTPUT_FOLDER`. Split container (Pi + redaction service): pre-apply — `python tools/verify_redaction_coverage.py` on downloaded CSV/PDF in your session workspace (edited CSV is not on the redaction server); post-apply — `POST /agent/verify_redaction_coverage` at `{GRADIO_URL}` with server paths from `/review_apply` (`extract_server_paths`). All Agent API paths must already exist on the redaction server — never Pi workspace or upload-temp paths. \|
	\| Draft vs deliverable PDF \| Post-apply checks on `_redactions_for_review.pdf` or early `_redacted.pdf` from `/doc_redact` \| Verify against post-apply `_redacted.pdf` from `/review_apply`*, basename ending in `_redacted.pdf`. \|
	\| Initial `/doc_redact` output \| Treat first-run `_redacted.pdf` as final \| Treat as draft. Pass 1 review + `/review_apply`* is required unless the user explicitly waives review. \|

	Delivery exception (only if `pass_strict` cannot be reached after one apply + one fix pass):

	Fill this in the summary markdown — do not deliver without it when `pass_strict: false`:

	\| Field \| Value \|
	\|-------\|--------\|
	\| Pages still failing strict \| _(e.g. 3, 5, 7, 19, 21)_ \|
	\| `leak_likely_causes` \| _(from coverage JSON per page)_ \|
	\| `pixel_failures` count \| _(0 = visually covered; >0 = real visual leak)_ \|
	\| User decision needed \| Accept text-stream artifact / run Pass 2 VLM on listed pages / manual review \|

	### Two-pass model (Pass 1 is the deliverable)

	Default: Pass 1 only. Pass 1 must be sufficient for delivery unless Pass 2 criteria below are met after Pass 1 completes.

	Do not run VLM on every page. Do not spawn per-page VLM subagents unless I explicitly request Pass 2 or Pass 1 leaves pages in `pages_flagged_for_vlm`.

	---

	### Pass 1 (required — complete end-to-end)

	1. Initial redaction — `POST /doc_redact` (or Gradio `api_name="/doc_redact"`) with settings from User redaction requirements below. Save artifacts to `{OUTPUT_BASE}output_redact/`.

	2. Review all pages in Pass 1 using OCR / CSV / text only (no VLM):
	- Load `_review_file.csv`, `_ocr_results_with_words_.csv`, `_ocr_output_*.csv`, original PDF from the same run.
	- Apply User redaction requirements (must redact / must not redact) programmatically to the review CSV.
	- Align missing boxes from word OCR (merge same-line tokens; separate boxes across lines). Use PyMuPDF positions only to author CSV rows (normalized 0–1), not to write the final PDF directly.
	- Run `verify_redaction_coverage` (CLI or `POST /agent/verify_redaction_coverage`) with `must_redact` / `must_not_redact` regex lists derived from every user requirement bullet (not just org keywords).
	- Fix policy issues until `pass_strict: true` (`uncovered_terms`, `over_redacted`, `text_layer_leaks` cleared) — batch fixes, then re-verify; avoid re-applying after each row.
	- Prune suspicious rows — short OCR fragments that do not match `must_redact` (`auto_prune_suspicious: true` or `--prune-suspicious`). Re-run coverage; target `pass_with_cleanup: true`.
	- Optional: `/preview_boxes` on highest-risk pages only (not every page).

	3. Apply (minimal calls) — `/review_apply` once from the parent agent (original PDF + merged/pruned CSV). Download newest outputs by `st_mtime` to `{OUTPUT_BASE}review/output_review_final/`. At most one additional `/review_apply` if Pass 2 edits the CSV or a single coordinated fix pass is still required; see Agent anti-confusion rules.

	4. Post-apply verification
	- Re-run `verify_redaction_coverage` with `redacted_pdf_path`.
	- Optional term search (`POST /agent/word_level_ocr_text_search`) for key names from user requirements.

	5. Pass 1 completion criteria
	- Post-apply coverage: `pass_strict: true` (required unless Delivery exception table in summary is filled — see anti-confusion rules)
	- Practitioner / allow-list names not over-redacted (per user requirements)
	- Deliverable PDF is post-apply `_redacted.pdf` under `{OUTPUT_BASE}review/output_review_final/`*
	- Write a brief summary markdown under `{OUTPUT_BASE}review/` (what was done, coverage results, any pages still needing optional Pass 2 or user sign-off)

	Hard gates: (1) deliverable = post-apply `*_redacted.pdf` only via `/review_apply`; (2) `pass_strict: true` required unless you fill the Delivery exception table; (3) at most two `/review_apply` calls total in Pass 1; (4) batch CSV fixes — no apply-per-row loops; (5) PyMuPDF may only help write CSV rows, not replace `/review_apply`.

	---

	### Pass 2 (optional — strict gate)

	Do not start Pass 2 unless one of the following is true:

	\| Criterion \| Action \|
	\|-----------\|--------\|
	\| I explicitly ask for visual / VLM review \| Run Pass 2 only on pages or range I specify, or on `pages_flagged_for_vlm` if I say “flagged pages only” \|
	\| Post-apply coverage lists `pages_flagged_for_vlm` \| VLM only those pages (sequential, max 1 concurrent on local VLM) \|
	\| Page has `uncovered_terms` for must-redact regex after Pass 1 fixes \| Targeted Pass 2 on that page \|
	\| Page has `text_layer_leaks` or `pixel_failures` \| Targeted Pass 2 on that page. If `leak_likely_causes` is `coord_mismatch_or_image_text` and `pixel_failures` is empty, prefer Pass 2 visual check over repeated full-span CSV boxes. \|
	\| Handwriting, stamps, signatures, or ink absent from word OCR and suspected to contain policy PII \| Targeted Pass 2 on that page after noting why in the summary \|

	Do not run Pass 2 for:

	- `pages_needing_csv_cleanup` alone — fix with suspicious-row prune, not VLM
	- Suspicious short OCR rows (`"-"`, `"."`, `"Ho"`, etc.)
	- “Review every page visually for completeness” on large documents
	- Bulk-adding every OCR token from VLM output (conservative CSV edits only)

	If Pass 2 runs: render PNGs for flagged pages only → one VLM call per page → conservative CSV patch → at most one additional `/review_apply` → re-run Pass 1 coverage.

	### Deployment — Pass 2 VLM (use only if Pass 2 criteria above are met)

	Agent / operator fills this block — not the end-user’s redaction policy. Use when Pass 2 visual review may run (e.g. face photos in scans, flagged pages after coverage). Pass 1 does not call this endpoint.

	Omit this entire section or write “N/A — no VLM deployed” if Pass 2 will never run.

	\| Setting \| Value \|
	\|---------\|--------\|
	\| Base URL \| `{VLM_BASE_URL}` \|
	\| Model \| `{VLM_MODEL}` \|
	\| API key \| _(e.g. `none` for local vLLM, or env var name like `VLM_API_KEY` — do not paste secrets in chat)_ \|
	\| Timeout (s) \| _(e.g. `240`)_ \|
	\| max_tokens \| _(e.g. `2500`)_ \|
	\| Notes \| _(e.g. reasoning model — check `reasoning_content`; sequential one page at a time)_ \|

	For simple user sections (non-expert bullets only), keep VLM connection details here — not under User redaction requirements. For detailed user sections, you may duplicate or move settings to Pass 2 VLM endpoint below instead.

	---

	### Technical constraints

	- Gradio: `{GRADIO_URL}`
	- Page scope: `{PAGE_RANGE}`
	- Review CSV basename must contain `_review_file`
	- CSV encoding: `utf-8-sig`
	- Reuse same-page `image` column value when adding rows
	- Long `httpx` read timeout for large PDFs (e.g. 1800s+)
	- Human review of redacted material is still assumed downstream
	- Do not bypass `/review_apply` with custom PyMuPDF “true redaction” scripts when `text_layer_leaks` appear — fix CSV boxes/coordinates and re-apply (see `doc-redaction-modifications` § Endpoint semantics)
	- If resuming a partial run: record which `*_redacted.pdf` is post-apply, re-run `verify_redaction_coverage` with that path (do not trust stale JSON without checking the PDF it was run against)
	- If the deployment has `POST_REDACT_PASS1_QA=True`, initial redaction may already emit `_coverage_report.json` (and optional `_review_file_pruned.csv`) — that is deployment sanity QA, not a substitute for this task's Pass 1 review and `/review_apply`
	- Use `{DEFAULT_OCR_METHOD}` for the initial text extraction, and `{DEFAULT_PII_METHOD}` for the PII identification model, unless the user specifies otherwise. Use a very long timeout with this method - allow for two minutes to complete each page.
	- {REMOTE_BACKEND_GUIDANCE}
	- {VLM_FACES_GUIDANCE}
	- {VLM_SIGNATURE_GUIDANCE}
	- Ensure that all redaction boxes cover genuine PII and are not false positives, unless the terms are specified by the user for redaction.
	- If you get stuck in a loop unable to redact, review pages, or apply modified redactions to a document, please refer back to the relevant skill (doc-redaction-app, or doc-redaction_modifications) for instructions on how to use the app properly.
	- Ensure that you save all the relevant output files to the output folder, including the redacted PDF, the review PDF, and the CSV review file.

	---

	## Specific rules for long documents (100+ pages — operator / agent rules)

	Use this section when `{PAGE_RANGE}` is `all` or spans 100+ pages (e.g. 500-page bundles). Pass 1 remains the deliverable; Pass 2 stays flagged-pages-only.

	### When to activate

	- Page count ≥ 100, or expected OCR CSV > 50k word rows, or a single `/doc_redact` call is likely to exceed ~30 minutes.
	- Set `{PAGE_RANGE}` explicitly (e.g. `1-500`) so subagents and coverage know scope.

	### Programmatic review (mandatory at scale)

	*Never load the full `_review_file.csv` or word OCR CSV into chat context.** At 500 pages these files can exceed model context limits.

	Instead:

	1. Edit CSVs with Python/pandas scripts on disk (regex policy, merge word boxes, add rows).
	2. Run `verify_redaction_coverage` via CLI or `POST /agent/verify_redaction_coverage`; read the JSON report only (`pages_with_policy_issues`, per-page flags).
	3. Fix only flagged pages; batch edits before re-running coverage (avoid verify-after-every-row).
	4. Use `--prune-suspicious` / `auto_prune_suspicious: true` once policy passes are stable.

	Parallel page subagents ([`doc-redact-page-review`](../doc-redact-page-review/SKILL.md)): spawn for policy-flagged pages or page-specific rules — not necessarily all 500 pages. Batch 3–5 concurrent children; parent merges once and calls `/review_apply` once.

	### Initial redaction — prefer CLI or chunks

	\| Approach \| When \|
	\|----------\|------\|
	\| `python cli_redact.py` (or `POST /agent/redact_document`) \| Multi-hour jobs; avoids Gradio HTTP read timeouts \|
	\| Chunked `/doc_redact` (`page_min` / `page_max`, e.g. 100 pages per run) \| Single call would exceed client timeout or server RAM \|
	\| Single `/doc_redact` \| ≤ ~100 pages or mostly text with `EFFICIENT_OCR=True` and long `httpx` read timeout (1800s+) \|

	After chunked redact: merge review CSVs with `combine_review_csvs` (Agent API / Gradio), then one `/review_apply` on the full original PDF.

	Deployment toggles (see `tools/config.py`, `config/app_config.env`):

	- `EFFICIENT_OCR=True` — text extraction first; OCR only on sparse/image pages.
	- `OCR_FIRST_PASS_MAX_WORKERS`, `PADDLE_MAX_WORKERS` — tune CPU/GPU parallelism.
	- `POST_REDACT_PASS1_QA=True` — optional post-redact coverage + prune (sanity QA only; not a substitute for this task's Pass 1 review).

	### Tiered Pass 1 workflow (500+ pages)

	1. Initial redact (CLI or chunked) → save under `{OUTPUT_BASE}output_redact/`.
	2. Programmatic regex pass on full review CSV.
	3. `verify_redaction_coverage` → fix pages in `pages_with_policy_issues`.
	4. Prune suspicious rows → re-verify until `pass_with_cleanup: true`.
	5. One `/review_apply` → post-apply verify with `redacted_pdf_path`.
	6. Optional: `find_duplicate_pages` — review one page per duplicate cluster, propagate CSV patterns.
	7. Pass 2 VLM only on `pages_flagged_for_vlm` or user-specified ranges — never full-document visual sweep.

	### Resume checkpoints

	If the agent session dies mid-task:

	- After step 1: resume from CSV edit + verify (artifacts in `{OUTPUT_BASE}output_redact/`).
	- After apply: record which `*_redacted.pdf` is post-apply; re-run verify against that path (do not trust stale JSON).

	### Pass 2 at scale

	Do not run VLM on every page. At ~1–2 min/page sequential on local VLMs, 500 pages ≈ 8–17 hours plus large image-token cost. Use Deployment — Pass 2 VLM only for explicit flagged subsets.

	---

	## User redaction requirements (authoritative for this task)

	- All signatures should be redacted
	- Any redaction box related to general country names should be removed
	- All redactions for Rudy Giuliani should be removed
	- All mentions of London, and 'Sister City' should be redacted