---
name: doc-redaction-tabular
description: "Skill for redacting tabular files (CSV/XLSX/Parquet/DOCX) using the Document Redaction app. gradio_client-first with discovery via view_api()."
version: 1.4.0
author: repo-maintained
license: AGPL-3.0-only
changelog:
  - "v1.4.0 (Apr 21, 2026): Added /find_duplicate_tabular docs with handle_file caveat. Clarified HF Space '0 entities' quirk. Added Excel multi-sheet handling. Output file naming convention section expanded strategy comparison. Added error handling patterns and edge case notes."
---

## Purpose

Redact **tabular and semi-tabular** files using the app:
- CSV (`.csv`)
- Excel (`.xlsx`, `.xls`)
- Parquet (`.parquet`)
- Word (`.docx`)

Registered endpoints:
- **`/tabular_redact`** — short, stateless tabular redaction (preferred when present)
- **`/redact_data`** — main tabular redaction (long signature)
- **`/find_duplicate_tabular`** — detect and remove duplicate rows

## When to use this skill

Use when the document is **not a PDF/image** and you want redaction applied directly to
a table-like file.
For PDFs/images, use `doc-redaction-app` with `api_name="/redact_document"` instead.

## Decision tree (recommended)

1. **Try `gradio_client` first** — confirmed working for both endpoints.
2. **`docker cp` + raw Gradio HTTP** — fallback if gradio_client fails.
3. **Raw Gradio HTTP API** — last resort: `/gradio_api/upload` +
   `/gradio_api/call/...` + poll.

## Tabular Redaction (short: `/tabular_redact`, long: `/redact_data`)

### Quick example (CSV)

```python
from gradio_client import Client, handle_file
import os

BASE_URL = "http://host.docker.internal:7861"
client = Client(BASE_URL)
csv_path = "/path/to/table.csv"
output_folder = "/home/user/app/output/"   # CONTAINER path

wrapped_files = [handle_file(csv_path)]

result = client.predict(
    file_paths=wrapped_files,
    in_text="",                               # required str
    anon_strategy="replace with 'REDACTED'",  # see strategies
    chosen_cols=["Name", "Email"],            # columns to anonymize
    chosen_redact_entities=[
        "PERSON", "PHONE_NUMBER",
        "EMAIL_ADDRESS"
    ],
    in_allow_list=[],
    latest_file_completed=0,                  # 0 for single file
    out_message="Tabular redaction test",     # required str
    in_excel_sheets=[],                       # not Excel; see multi-sheet below
    first_loop_state=True,                    # always True for fresh run
    output_folder=output_folder,
    in_deny_list=[],
    max_fuzzy_spelling_mistakes_num=0,
    pii_identification_method="Local",
    chosen_redact_comprehend_entities=[],
    aws_access_key_textbox="",
    aws_secret_key_textbox="",
    do_initial_clean=True,
    language="en",
    progress="",
    custom_llm_instructions=["PERSON_NAME"],
    api_name="/redact_data"
)

# Returns 4 outputs: (str status, list[filepath], int count, list[filepath])
print(f"Status: {result[0]}")
print(f"Output file: {result[1][0] if result[1] else 'N/A'}")
print(f"Files redacted: {result[2]}")
```

### Excel multi-sheet handling

To discover sheet names, read the file locally first:
```python
import pandas as pd
sheets = pd.ExcelFile(csv_path).sheet_names
# e.g. ['Sheet1', 'Q1_Data', 'Summary']
```

Pass discovered names to `in_excel_sheets`:
```python
result = client.predict(
    ...,
    in_excel_sheets=["Sheet1", "Q1_Data"],  # sheet name strings
    ...
)
```

If the file has no sheets or the list is empty, the app processes all available sheets.

### HF Space (Hugging Face)

Public deployment: `https://seanpedrickcase-document-redaction.hf.space`
Same API. Key differences from local Docker:
- No "Local Inference Server" option
- No VLM face/signature entities
- Stricter file validation
- ~2–3× slower (free-tier CPU)

#### HF Space output access

`result[1]` and `result[3]` are **local temp paths** that
`gradio_client` already downloaded. Read directly:
```python
output_file = result[1][0]
with open(output_file, "r", encoding="utf-8-sig") as f:
    print(f.readline().strip())   # first row (may have BOM)
```

> **`client.download_file()` (singular) does NOT exist** on some
> `gradio_client` versions. Use the output paths directly.

#### HF Space "0 entities" quirk

On HF Space the entity count (`result[2]`) may show `0` even when
redaction is applied correctly. **Always verify by reading the
output file**, not the log count. This is a known display issue
with the spaCy model on HF.

### Container path caveat (Docker)

| Path | Works? |
|------|--------|
| `/home/user/app/output/` | Yes — container-internal, app's OUTPUT_FOLDER
| `/tmp/test_data/output/` (host) | No — server runs inside the container

**Workaround**: Use container paths. Retrieve files via `docker cp`:
```bash
# Find the file
docker exec doc_redaction-redaction-app-llama-1 \
  ls -lt /home/user/app/output/ | grep combined_case_notes

# Download
docker cp doc_redaction-redaction-app-llama-1:/home/user/app/output/combined_case_notes_anon_redact_replace.csv ./output.csv
```

## Anonymization strategies (all 4 tested)

| Strategy | Behavior | Example ("Jane Smith") |
|----------|----------|------------------------|
| `replace with 'REDACTED'` | Replaces with literal text `REDACTED` | `REDACTED` (8 chars)
| `redact completely` | Removes content (empty cell) | *(blank)*
| `mask` | Replaces with asterisks matching original length | `**********` (10 chars)
| `hash` | SHA-256 hash of original value (consistent per entity) | `ca85b082d2e6...` (94 chars)

### Choosing a strategy — use case guide

| Use case | Recommended | Why |
|----------|-------------|-----|
| GDPR compliance | `redact completely` | Maximum privacy, no length leakage |
| Audit trail | `replace with 'REDACTED'` | Clear indication of redaction occurred |
| Data science (preserve structure) | `mask` or `hash` | Keeps row/column dimensions intact |
| Cross-row correlation analysis | `hash` | Same entity → same hash across all rows |
| Legal review (need to know what was there) | `replace with 'REDACTED'` + log | Redaction text + decision log |
| Machine learning (feature engineering) | `mask` or `hash` | Preserves data shape for pipelines |
| Maximum privacy (no length info leaked) | `redact completely` | No clue about original value length |

## Key parameters

| Parameter | Type | Notes |
|-----------|------|-------|
| `chosen_cols` | list[str] | Column names to anonymize. If empty, all columns with PII are processed. Use exact column header names from the CSV/XLSX. |
| `chosen_redact_entities` | list[str] | **Local** (spaCy + custom recognizers) — full list in [`../doc-redaction-app/SKILL.md`](../doc-redaction-app/SKILL.md) §2e. Includes `TITLES`, `UKPOSTCODE`, `STREETNAME`, `CUSTOM`, `CUSTOM_FUZZY` |
| `chosen_redact_comprehend_entities` | list[str] | **AWS Comprehend** labels when `pii_identification_method="AWS Comprehend"` — full list in doc-redaction-app §2e; may also include `TITLES`, `UKPOSTCODE`, `STREETNAME`, `CUSTOM`, `CUSTOM_FUZZY` |
| `in_deny_list` / `in_allow_list` | list[str] | Terms always redacted (`CUSTOM` / `CUSTOM_FUZZY`) or never redacted (`allow_list`). See doc-redaction-app §2e |
| `max_fuzzy_spelling_mistakes_num` | int | 0 = exact match. 1–2 allows fuzzy matching (useful for typos in PII). |
| `do_initial_clean` | bool | True: strips whitespace, normalizes text. Recommended for dirty data. |
| `pii_identification_method` | str | `"Local"` (spaCy, no API keys) or `"AWS Comprehend"` (requires AWS creds). |
| `in_excel_sheets` | list[str] | For XLSX files: specific sheet names to process. Empty = all sheets. |
| `custom_llm_instructions` | list[str] | Entity types for LLM fallback (e.g., `["PERSON_NAME", "EMAIL_ADDRESS"]`). |
| `language` | str | `"en"`, `"fr"`, `"de"`, etc. Affects spaCy model language settings. |
| `latest_file_completed` | float | `0` for single file, `1.0` when last of a batch (for multi-file runs). |

## spaCy NER limitation

The default `Local` PII detection uses spaCy's English model. It **may miss short
names with initials** like "Alex D." — tested on `combined_case_notes.csv` (18 rows)
showed only "Jane Smith" (10 chars, full name) was detected in the Social Worker column,
but "Alex D." (7 chars, with initial) in the Client column was not. This is a known
spaCy limitation with short/abbreviated names.

**Workarounds:**
- `max_fuzzy_spelling_mistakes_num=1` for partial matches (won't help with initials)
- `custom_llm_instructions` with entity types for LLM-based fallback detection (requires external LLM)
- `in_deny_list` with known names to exclude from redaction (e.g., your own company name)
- `do_initial_clean=False` if cleaning is removing useful context

## Output file naming convention

The app generates files with predictable naming:
```
{original_name}_anon_{strategy}.csv
{original_name}_anon_{strategy}.csv_log.csv   (decision log)
```

| Original | Strategy | Output file |
|----------|----------|-------------|
| `combined_case_notes` | `replace with 'REDACTED'` | `combined_case_notes_anon_redact_replace.csv`
| `combined_case_notes` | `redact completely` | `combined_case_notes_anon_redact_remove.csv`
| `combined_case_notes` | `mask` | `combined_case_notes_anon_mask.csv`
| `combined_case_notes` | `hash` | `combined_case_notes_anon_hash.csv`

## Retrieving output files (Docker)

Outputs are written under the container's `OUTPUT_FOLDER`. Retrieve via:

```bash
# Find the file in container
docker exec doc_redaction-redaction-app-llama-1 \
  ls -lt /home/user/app/output/ | grep combined_case_notes | head -10
```

```bash
# Download the redacted CSV
docker cp doc_redaction-redaction-app-llama-1:/home/user/app/output/combined_case_notes_anon_redact_replace.csv ./output.csv
```

## Log file format (decision log)

The `_log.csv` file contains one row per PII entity detected:
```csv
entity_type,start,end,data_row,column,entity
PERSON,0,10,0,Social Worker,Jane Smith
```

- `entity_type` — spaCy entity label (PERSON, PHONE_NUMBER, etc.)
- `start,end` — character offsets in the cell value
- `data_row` — 0-indexed row number
- `column` — column name
- `entity` — the detected PII text

**Read with `encoding="utf-8-sig"` in Python** to strip the BOM
(`\ufeff`) that may appear at the start of CSV files.

### HF Space log caveat

On HF Space the log file path is returned in `result[3]` (not `result[2]`).
The entity count (`result[2]`) may show `0` even when redaction IS applied
correctly. **Always verify by reading the output file**, not the log count.

## Deployment variants

### HF Space (Hugging Face)

Public deployment: `https://seanpedrickcase-document-redaction.hf.space`
Same API. Key differences from local Docker:

| Aspect | Local Docker | HF Space |
|--------|-------------|----------|
| PII detection | Local, **Local Inference Server**, AWS Comprehend | Local, AWS Comprehend (no inference server) |
| OCR models | tesseract, paddle, **hybrid-paddle-inference-server**, inference-server | tesseract, paddle only |
| VLM entities | CUSTOM_VLM_FACES, CUSTOM_VLM_SIGNATURE available | **NOT** available (no GPU/VLM support) |
| efficient_ocr default | True | False (saves compute on free tier) |
| Speed | ~1.5s per request (GPU machine) | ~3–4.5s per request (~2–3× slower, CPU free tier) |
| File validation | Accepts any file type via API | Strict: only `.pdf, .jpg, .png, .json, .zip` for `/redact_document` |
| Output access | `docker cp` from container | Read `gradio_client` output paths directly (already cached locally) |

### Speed comparison (tested Apr 21, 2026)

| Strategy | Local Docker | HF Space |
|----------|-------------|----------|
| replace with REDACTED | ~1.5s | ~3.5s |
| redact completely | ~1.5s | ~3.4s |
| mask | ~1.5s | ~3.6s |
| hash | ~1.5s | ~4.5s |

## Error handling patterns

### Common errors and fixes

| Error | Cause | Fix |
|-------|-------|-----|
| `Cannot save file into a non-existent directory` | `output_folder` path doesn't exist in container | Use `/home/user/app/output/` or ensure the directory exists |
| `'meta' field must be explicitly provided. | Raw string passed to `files=` | Wrap with `handle_file(path)` |
| `Invalid file type. Please upload. | CSVs rejected by `/redact_document | Use `/redact_data` for tabular files |
| Timeout after ~3s | HF Space free tier spinning down | Accept ~3.5s runtime. No fix. |
| Entity count shows 0 | HF Space display issue | Read output file to verify. |

### Defensive coding pattern

```python
import os
from gradio_client import Client, handle_file

def safe_redact(client, csv_path, strategy="replace with 'REDACTED'"):
    """Redact a CSV file with error handling."""
    wrapped = [handle_file(csv_path)]
    
    result = client.predict(
        file_paths=wrapped,
        in_text="",
        anon_strategy=strategy,
        chosen_cols=["Name", "Email"],  # adjust to your columns
        out_message="Tabular redaction test",
        output_folder="/home/user/app/output/",
        api_name="/redact_data"
    )
    
    if not result[1]:
        raise RuntimeError("No output file returned")
    
    return result

# Usage with error handling
try:
    result = safe_redact(client, "/path/to/table.csv", "mask")
    print(f"Output: {result[1][0]}")
except Exception as e:
    print(f"Redaction failed: {e}")
```

## Edge cases & gotchas

- **CSV BOM**: CSV files may have a UTF-8 BOM (`\ufeff`) at the start. Read with `encoding="utf-8-sig"` in Python.
- **Empty columns**: If a column has no PII detected, it's left unchanged — verify your `chosen_cols` match actual headers.
- **Excel sheet names with special chars**: Use exact sheet names as they appear (e.g., `"Q1 '23 Data"`).
- **Large files**: spaCy NER on large CSVs can be slow. Test with a subset first.
- **DOCX tables**: Only tabular content in DOCX is redacted — body text paragraphs are NOT processed by `/redact_data`. Use `/redact_document` for full DOCX processing.
- **Parquet encoding**: Ensure Parquet files use UTF-8 string columns. Non-string types may be skipped by spaCy.
- **Multi-file batches**: Set `latest_file_completed=1.0` only on the last file in a batch to trigger final processing.
- **HF Space cold starts**: HF Space may take 10-30s on first call after idle. Subsequent calls are ~3-4s. |