---
title: TDB Intake
emoji: 🔬
colorFrom: blue
colorTo: green
sdk: streamlit
sdk_version: "1.39.0"
app_file: app.py
pinned: false
---

# Clinical Trial AI Reproduction Benchmark — Intake

A Streamlit intake form for trial statisticians. Submissions are saved to a **Hugging Face Dataset** repo. An **Admin** page (in the sidebar) lets reviewers triage submissions (`pending` / `reviewed` / `needs_fix`).

## What it does

- **Form (`app.py`)** — statisticians enter `trial_id`, `username`, and a list of questions. Each question has:
  - `design_element` (dropdown — when "Others" is picked, a free-text input appears)
  - `question_type` (dropdown — `extraction_only` / `derivation_required`)
  - `question` (free text)
  - **Rubrics** auto-generated by question type:
    - `extraction_only` → 1 rubric: `output.json`
    - `derivation_required` → 4 rubrics: `output.json` × {Inputs used, Calculated value, Method} + `output.R` × {Reproducibility}
  - Each rubric collects `points`, `tolerance`, `criterion`.
- **Admin page (`pages/1_Admin.py`)** — password-gated review console. A submission can be reviewed many times by different people: each review (status + reviewer name + comment) is written as its own file under `reviews/<submission>/`, and the page shows the full timeline. The current status is the most recent review's status. Submissions themselves are never modified.

## Run locally

```bash
python -m venv .venv && source .venv/bin/activate
pip install -r requirements.txt
streamlit run app.py
```

Without HF env vars set, submissions land in `./data/submissions/<...>.json` on disk — fine for dev.

## Deploy on Hugging Face Spaces

### 1. Create a private HF Dataset repo

- Sign in at <https://huggingface.co>
- Click your avatar → **New Dataset**
- Owner: your username (e.g. `ttt-77`)
- Name: e.g. `tdb-intake-submissions`
- Visibility: **Private**
- Create. Leave it empty.

### 2. Generate an HF access token

- <https://huggingface.co/settings/tokens> → **New token**
- Token type: **Write**
- Save the `hf_...` string.

### 3. Create the Space

- Click your avatar → **New Space**
- Name: e.g. `tdb-intake`
- SDK: **Streamlit**
- Visibility: your choice (public works; the *form* is intended for public submission, only *data* needs to be private)
- Create — HF gives you a git repo URL.

### 4. Push this code to the Space

```bash
git remote add hf https://huggingface.co/spaces/<your-username>/tdb-intake
git push hf main
```

Or, in the HF Space's **Settings → Repository**, link this GitHub repo and HF will auto-sync on push.

### 5. Add Space secrets

In the Space → **Settings → Variables and secrets** → add as **secrets**:

| Name | Value |
| --- | --- |
| `HF_TOKEN` | the token from step 2 |
| `HF_DATASET_REPO` | `<your-username>/tdb-intake-submissions` |
| `HF_DATASET_BRANCH` | `main` (optional, defaults to `main`) |
| `ADMIN_PASSWORD` | a password to share with reviewers |

The Space will restart automatically and pick up the new secrets.

### 6. Test

- Open the Space URL → fill the form → **Submit**. A new file lands in `submissions/<trial_id>__<username>__<timestamp>.json` in the dataset repo.
- Open the **Admin** page (left sidebar) → enter password → see the submission with status `pending` → add a review (your name + status + comment). It appears in the review timeline and a new file lands under `reviews/<submission>/`. Add more reviews to build up the history.

## Dataset layout

Submissions are **immutable**. Each review is a **separate file** — so a
submission can be reviewed many times by different people, and concurrent
reviews never conflict (each is a brand-new file, never an overwrite).

```text
submissions/<trial>__<user>__<stamp>.json            # the submission (never rewritten)
reviews/<trial>__<user>__<stamp>/<stamp>__<rev>.json # one file per review
```

### Submission file (`submissions/*.json`)

```json
{
  "submissionId": "submissions/NCT0001__jdoe__2026-06-01T...Z.json",
  "submittedAt": "2026-06-01T...",
  "trial_id": "NCT0001",
  "username": "jdoe",
  "comparison": {
    "trial_id": "NCT0001",
    "username": "jdoe",
    "prompts": [
      {
        "id": "P-001",
        "design_element": "Sample size and power",
        "design_element_other": "",
        "question": "Total target PFS events",
        "question_type": "derivation_required",
        "rubrics": [
          {"artifact": "output.json", "dimension": "Inputs used",      "points": "5", "criterion": "...", "tolerance": "..."},
          {"artifact": "output.json", "dimension": "Calculated value", "points": "5", "criterion": "...", "tolerance": "±5%"},
          {"artifact": "output.json", "dimension": "Method",            "points": "5", "criterion": "...", "tolerance": "..."},
          {"artifact": "output.R",    "dimension": "Reproducibility",   "points": "5", "criterion": "...", "tolerance": "..."}
        ]
      }
    ]
  }
}
```

### Review file (`reviews/<submission>/*.json`)

```json
{
  "submissionId": "submissions/NCT0001__jdoe__2026-06-01T...Z.json",
  "at": "2026-06-01T16:00:00+00:00",
  "reviewer": "Dr. Lee",
  "status": "needs_fix",
  "note": "still missing the power assumption"
}
```

The **current status** of a submission is derived as the most recent review's
status (or `pending` if it has no reviews yet).

### Load everything in Python

```python
from huggingface_hub import snapshot_download
import json, glob, os

local = snapshot_download("ttt-77/tdb-intake-submissions", repo_type="dataset")

submissions = {
    os.path.basename(f)[:-5]: json.load(open(f))
    for f in glob.glob(f"{local}/submissions/*.json")
}
# reviews grouped by submission base name
reviews = {}
for f in glob.glob(f"{local}/reviews/*/*.json"):
    base = os.path.basename(os.path.dirname(f))
    reviews.setdefault(base, []).append(json.load(open(f)))
for base in reviews:
    reviews[base].sort(key=lambda r: r["at"])  # oldest first
```

## Project structure

```text
.
├── app.py                  # main intake form (entry point for HF Space)
├── pages/
│   └── 1_Admin.py          # admin review page (shown in sidebar)
├── lib/
│   ├── __init__.py
│   ├── schema.py           # constants, defaults, validators
│   └── storage.py          # HF Dataset I/O + local fs fallback + admin password check
├── requirements.txt
└── README.md
```

## Privacy notes

- The dataset repo should be **private**.
- `HF_TOKEN` and `ADMIN_PASSWORD` live only in Space secrets — never commit them.
- Rotate the token periodically.

## Extending with Python ML libs

Adding NLP / model checks is now a few lines in `lib/`. Examples:

- `spaCy` for entity extraction on submitted SAP excerpts
- `sentence-transformers` for semantic dedup of similar questions
- `huggingface_hub.InferenceClient` for LLM-as-judge on the criterion text
- `pandas` directly in the admin page for batch stats / CSV export