JacobLinCool Codex commited on
Commit
8457788
·
verified ·
1 Parent(s): 92537b9

feat: add privacy filtering and execution modes

Browse files

Co-authored-by: Codex <noreply@openai.com>

README.md CHANGED
@@ -20,11 +20,11 @@ it claimed completion.
20
 
21
  Built for the Build Small Hackathon. The frontend is a custom React field-notebook
22
  UI (a trail map of the session) served by `gradio.Server`; it calls the Python
23
- `analyze_trace` endpoint through `@gradio/client`. Both models run on the Space
24
- GPU through ZeroGPU: a quick `Qwen/Qwen3.5-9B` pass by default, and the larger
25
- `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` for deeper analysis. A verified
26
- deterministic codebook analyzer is the always-available recovery path and needs
27
- no model or GPU.
28
 
29
  ## Architecture
30
 
@@ -39,6 +39,8 @@ no model or GPU.
39
  renders (synthesizes the whole-session `verdict`, `captured`, `duration_total`).
40
  - `analyzer.py` / `parser.py` / `redaction.py` / `schemas.py` — the deterministic
41
  pipeline. `model_runtime.py` — the optional small-model assist on ZeroGPU.
 
 
42
 
43
  ## Run Locally
44
 
@@ -57,7 +59,7 @@ python3.11 -m unittest discover -s tests
57
 
58
  ## Analysis Engines
59
 
60
- - `Qwen3.5 9B — quick analysis`: default model pass on the Space GPU.
61
  - `NVIDIA Nemotron 3 Nano 30B-A3B — deeper analysis`: the larger model on the
62
  Space GPU for a richer memo.
63
  - `Rule-based — instant, no model`: local codebook analyzer, no model or GPU.
@@ -67,10 +69,41 @@ in model notes and returns the deterministic analysis instead of failing the
67
  whole Space.
68
 
69
  The model-backed analysis runs under `@spaces.GPU(size="xlarge")` so the weights
70
- load on Hugging Face ZeroGPU hardware; `Qwen/Qwen3.5-9B` and
71
  `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` are loaded with `transformers` and
72
- cached across requests. The rule-based engine runs on CPU and never requests a
73
- GPU slot, so it returns instantly.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
74
 
75
  ## Agent Session Locations
76
 
@@ -89,5 +122,7 @@ ls ~/.pi/agent/sessions
89
 
90
  Agent traces can contain prompts, tool inputs, command outputs, local file paths,
91
  screenshots, secrets, private source code, and personal data. Review and redact
92
- before uploading or sharing publicly. The app defaults to basic regex redaction
93
- and exports only a redacted narrative text file.
 
 
 
20
 
21
  Built for the Build Small Hackathon. The frontend is a custom React field-notebook
22
  UI (a trail map of the session) served by `gradio.Server`; it calls the Python
23
+ `analyze_trace` endpoint through `@gradio/client`. Both analysis models run on the
24
+ Space GPU through ZeroGPU: a quick `openbmb/MiniCPM5-1B` pass by default, and the
25
+ larger `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` for deeper analysis. Redaction
26
+ adds a PII pass with `openai/privacy-filter`. A verified deterministic codebook
27
+ analyzer is the always-available recovery path and needs no model or GPU.
28
 
29
  ## Architecture
30
 
 
39
  renders (synthesizes the whole-session `verdict`, `captured`, `duration_total`).
40
  - `analyzer.py` / `parser.py` / `redaction.py` / `schemas.py` — the deterministic
41
  pipeline. `model_runtime.py` — the optional small-model assist on ZeroGPU.
42
+ `privacy_filter.py` — the optional `openai/privacy-filter` PII redaction pass.
43
+ `profiling.py` — logging + per-request stage timing and resource probes.
44
 
45
  ## Run Locally
46
 
 
59
 
60
  ## Analysis Engines
61
 
62
+ - `MiniCPM5 1B — quick analysis`: default model pass on the Space GPU.
63
  - `NVIDIA Nemotron 3 Nano 30B-A3B — deeper analysis`: the larger model on the
64
  Space GPU for a richer memo.
65
  - `Rule-based — instant, no model`: local codebook analyzer, no model or GPU.
 
69
  whole Space.
70
 
71
  The model-backed analysis runs under `@spaces.GPU(size="xlarge")` so the weights
72
+ load on Hugging Face ZeroGPU hardware; `openbmb/MiniCPM5-1B` and
73
  `nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16` are loaded with `transformers` and
74
+ cached across requests. The deterministic codebook analysis itself runs on CPU;
75
+ only the model assist and the `openai/privacy-filter` redaction pass use the GPU,
76
+ and both fall back gracefully (deterministic analysis / regex-only redaction)
77
+ when no GPU model is available.
78
+
79
+ ## Execution modes
80
+
81
+ Each `analyze_trace` call takes an `execution_mode`:
82
+
83
+ - `zerogpu` (default): the model passes run inside `@spaces.GPU` on the Space GPU.
84
+ - `cpu`: the model passes run on the Space (or local) CPU with **no GPU quota** —
85
+ slower, but it still works when ZeroGPU quota is exhausted. The frontend exposes
86
+ this as a **Run on** choice so users without quota can still use the app.
87
+
88
+ Model loading is device-aware (CUDA → Apple MPS → CPU), so the app also runs
89
+ locally for development; on a Mac the small models run on MPS, and the
90
+ deterministic engine needs no model at all. Because of the slower paths, the
91
+ frontend streams real progress — current stage, % complete, messages processed,
92
+ elapsed time, and a best-effort ETA — so a long run never looks stuck.
93
+
94
+ ## Logging & profiling
95
+
96
+ The pipeline writes diagnostics to the standard logger (never the UI): per-request
97
+ message count, per-stage timing, total time, model load/inference time with the
98
+ device used, and a resource snapshot (process RSS, system memory, CPU, and
99
+ GPU/MPS memory). Set the level with `TFN_LOG_LEVEL` (default `INFO`; use `DEBUG`
100
+ for per-stage detail). Example summary line:
101
+
102
+ ```
103
+ analyze[zerogpu/minicpm] done in 19.4s | messages=4 redactions=2 episodes=1
104
+ | stages: extract=0ms, redact=9503ms, chart=4ms, classify=0ms, model_assist=9918ms
105
+ | rss=2180MB sysmem=68% mps=4732MB
106
+ ```
107
 
108
  ## Agent Session Locations
109
 
 
122
 
123
  Agent traces can contain prompts, tool inputs, command outputs, local file paths,
124
  screenshots, secrets, private source code, and personal data. Review and redact
125
+ before uploading or sharing publicly. Redaction defaults to regex patterns plus a
126
+ model pass (`openai/privacy-filter`) that flags names, contacts, and other
127
+ personal data on the Space GPU; the regex pass is the always-available fallback
128
+ when the model is not loaded. The app exports only a redacted narrative text file.
analyzer.py CHANGED
@@ -3,15 +3,30 @@
3
  from __future__ import annotations
4
 
5
  import re
 
6
  from collections import Counter
7
  from datetime import datetime, timezone
8
  from pathlib import Path
9
  from typing import Iterable
10
 
11
- from model_runtime import MODEL_CHOICES, run_model_assist
12
  from parser import parse_trace
 
13
  from redaction import redact_text
14
- from schemas import AnalysisResult, DifficultyEpisode, MessageSpan, NarrativeMessage
 
 
 
 
 
 
 
 
 
 
 
 
 
15
 
16
 
17
  ANALYSIS_SCOPE = (
@@ -143,26 +158,51 @@ PROBLEM_EVIDENCE_SIGNALS = {
143
  ANALYSIS_STEPS = ("extract", "redact", "chart", "classify", "synthesize")
144
 
145
 
 
 
 
 
 
 
 
 
146
  def stream_deterministic_analysis(
147
  path: str | Path,
148
  *,
149
  include_user_context: bool = True,
150
  redact_secrets: bool = True,
151
  ignore_tool_calls: bool = True,
 
 
 
152
  ):
153
  """Run the deterministic pipeline as a generator.
154
 
155
- Yields ``("step", name)`` after each real stage completes (the names in
156
- :data:`ANALYSIS_STEPS`), then a final ``("result", (AnalysisResult, str))``.
157
- Callers that don't care about progress can just drain it for the tuple.
 
 
 
 
 
 
 
158
  """
159
 
 
 
 
160
  parsed_messages, agent_type = parse_trace(
161
  path,
162
  include_user_context=include_user_context,
163
  ignore_tool_calls=ignore_tool_calls,
164
  )
165
- yield ("step", "extract")
 
 
 
 
166
 
167
  redaction_count = 0
168
  privacy_notes = [
@@ -172,26 +212,79 @@ def stream_deterministic_analysis(
172
  if ignore_tool_calls:
173
  privacy_notes.append("Tool-call contents were ignored before analysis.")
174
 
 
175
  messages = parsed_messages
176
  if redact_secrets:
177
- redacted_messages: list[NarrativeMessage] = []
178
  all_notes: Counter[str] = Counter()
179
- for message in parsed_messages:
180
- red = redact_text(message.text)
181
- redaction_count += red.count
182
- for note in red.notes:
183
- label, _, count = note.partition(": ")
184
- all_notes[label] += int(count or 0)
185
- redacted_messages.append(
186
- NarrativeMessage(
187
- index=message.index,
188
- role=message.role,
189
- text=red.text,
190
- timestamp=message.timestamp,
191
- source=message.source,
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
192
  )
 
 
 
 
 
 
 
193
  )
 
194
  messages = redacted_messages
 
 
 
 
 
 
195
  if all_notes:
196
  privacy_notes.append(
197
  "Redactions applied: "
@@ -199,14 +292,22 @@ def stream_deterministic_analysis(
199
  + "."
200
  )
201
  else:
202
- privacy_notes.append("No likely secrets matched the built-in redaction patterns.")
203
  else:
204
  privacy_notes.append("Secret redaction was disabled by the user.")
205
- yield ("step", "redact")
 
 
 
 
206
 
 
207
  episodes = identify_episodes(messages)
208
- yield ("step", "chart")
 
 
209
 
 
210
  result = AnalysisResult(
211
  trace_title=derive_trace_title(path, agent_type),
212
  agent_type_guess=agent_type,
@@ -218,55 +319,194 @@ def stream_deterministic_analysis(
218
  redaction_count=redaction_count,
219
  engine="deterministic-codebook",
220
  )
221
- yield ("step", "classify")
 
222
 
 
223
  narrative_text = render_redacted_narrative(messages)
224
- yield ("step", "synthesize")
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
225
 
226
- yield ("result", (result, narrative_text))
227
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
228
 
229
- def apply_model_assist(
 
230
  result: AnalysisResult,
231
- narrative_text: str,
232
  analysis_engine: str,
233
  *,
234
  run=None,
235
  ) -> None:
236
- """Augment a deterministic result with model assist, with graceful fallback.
237
 
238
- ``run`` defaults to the module-level :func:`run_model_assist` (resolved at
239
- call time so tests can monkeypatch it); the Server passes a GPU-wrapped
240
- runner so model inference happens inside a ``@spaces.GPU`` allocation. The
241
- result is mutated in place; any failure leaves the deterministic result and
242
- records the reason in ``model_notes``.
243
  """
244
 
245
  if analysis_engine == "deterministic":
246
  return
247
  if analysis_engine not in MODEL_CHOICES:
248
  result.model_notes.append(
249
- f"Unknown analysis engine {analysis_engine!r}; deterministic analysis was returned."
250
  )
251
  return
252
- runner = run or run_model_assist
 
 
 
253
  try:
254
- assist = runner(
255
  engine=analysis_engine,
256
- result=result,
257
- narrative_text=narrative_text,
 
258
  )
259
  except Exception as exc:
260
  error_message = str(exc).strip().rstrip(".")
261
  result.model_notes.append(
262
- "Model assist was requested but unavailable: "
263
  f"{type(exc).__name__}: {error_message}. "
264
- "Deterministic analysis was returned."
265
  )
 
 
 
 
 
 
 
 
 
 
 
 
 
266
  else:
267
- result.engine = f"deterministic-codebook + {assist.model_id}"
268
- result.model_memo = assist.memo
269
- result.model_notes.append(assist.note)
 
 
 
270
 
271
 
272
  def analyze_trace_file(
@@ -282,6 +522,7 @@ def analyze_trace_file(
282
 
283
  result: AnalysisResult | None = None
284
  narrative_text = ""
 
285
  for kind, payload in stream_deterministic_analysis(
286
  path,
287
  include_user_context=include_user_context,
@@ -289,9 +530,9 @@ def analyze_trace_file(
289
  ignore_tool_calls=ignore_tool_calls,
290
  ):
291
  if kind == "result":
292
- result, narrative_text = payload
293
  assert result is not None
294
- apply_model_assist(result, narrative_text, analysis_engine)
295
  return result, narrative_text
296
 
297
 
 
3
  from __future__ import annotations
4
 
5
  import re
6
+ import time
7
  from collections import Counter
8
  from datetime import datetime, timezone
9
  from pathlib import Path
10
  from typing import Iterable
11
 
12
+ from model_runtime import MODEL_CHOICES, run_model_analysis
13
  from parser import parse_trace
14
+ from profiling import Profiler, get_logger
15
  from redaction import redact_text
16
+ from schemas import (
17
+ APPRAISALS,
18
+ DETOUR_TYPES,
19
+ DIFFICULTY_TYPES,
20
+ OUTCOME_CLAIMS,
21
+ RECOVERY_PATTERNS,
22
+ RESOLUTION_MODES,
23
+ AnalysisResult,
24
+ DifficultyEpisode,
25
+ MessageSpan,
26
+ NarrativeMessage,
27
+ )
28
+
29
+ logger = get_logger()
30
 
31
 
32
  ANALYSIS_SCOPE = (
 
158
  ANALYSIS_STEPS = ("extract", "redact", "chart", "classify", "synthesize")
159
 
160
 
161
+ def _accumulate_notes(counter: Counter[str], notes: Iterable[str]) -> None:
162
+ """Fold ``"label: count"`` note strings into a running counter."""
163
+
164
+ for note in notes:
165
+ label, _, count = note.partition(": ")
166
+ counter[label] += int(count or 0)
167
+
168
+
169
  def stream_deterministic_analysis(
170
  path: str | Path,
171
  *,
172
  include_user_context: bool = True,
173
  redact_secrets: bool = True,
174
  ignore_tool_calls: bool = True,
175
+ model_redact=None,
176
+ profiler: Profiler | None = None,
177
+ stream_redact_progress: bool = False,
178
  ):
179
  """Run the deterministic pipeline as a generator.
180
 
181
+ Yields ``("progress", info)`` after each real stage completes ``info`` has
182
+ a ``stage`` name (one of :data:`ANALYSIS_STEPS`) and the running ``messages``
183
+ count then a final ``("result", (AnalysisResult, str))``. Callers that
184
+ don't care about progress can just drain it for the tuple.
185
+
186
+ ``model_redact`` is an optional ``(list[str]) -> list[RedactionResult]``
187
+ callable applied on top of regex redaction; the Server injects a GPU- or
188
+ CPU-bound ``openai/privacy-filter`` pass. It is absent locally and in tests,
189
+ so redaction falls back to regex only. ``profiler`` collects per-stage
190
+ timings; one is created if not supplied.
191
  """
192
 
193
+ prof = profiler or Profiler("deterministic")
194
+
195
+ _started = time.perf_counter()
196
  parsed_messages, agent_type = parse_trace(
197
  path,
198
  include_user_context=include_user_context,
199
  ignore_tool_calls=ignore_tool_calls,
200
  )
201
+ prof.record("extract", time.perf_counter() - _started)
202
+ message_count = len(parsed_messages)
203
+ prof.mark(messages=message_count, agent=agent_type)
204
+ logger.info("parsed %d narrative messages (agent=%s)", message_count, agent_type)
205
+ yield ("progress", {"stage": "extract", "messages": message_count})
206
 
207
  redaction_count = 0
208
  privacy_notes = [
 
212
  if ignore_tool_calls:
213
  privacy_notes.append("Tool-call contents were ignored before analysis.")
214
 
215
+ _redact_started = time.perf_counter()
216
  messages = parsed_messages
217
  if redact_secrets:
 
218
  all_notes: Counter[str] = Counter()
219
+ redacted_messages: list[NarrativeMessage] = []
220
+ model_used = False
221
+ model_failed = False
222
+
223
+ # Process in chunks so slow (CPU) runs can stream per-message progress.
224
+ # Without streaming (ZeroGPU) it is a single chunk = one GPU allocation;
225
+ # with streaming the update count is capped at ~30 regardless of size.
226
+ if stream_redact_progress and message_count:
227
+ chunk = max(1, (message_count + 29) // 30)
228
+ else:
229
+ chunk = message_count or 1
230
+
231
+ for start in range(0, message_count, chunk):
232
+ chunk_messages = parsed_messages[start : start + chunk]
233
+
234
+ # Pass 1: deterministic regex redaction (always available).
235
+ regex_results = [redact_text(message.text) for message in chunk_messages]
236
+ texts = [red.text for red in regex_results]
237
+
238
+ # Pass 2: optional model PII pass on top. The Server injects a GPU- or
239
+ # CPU-bound openai/privacy-filter pass; it is absent locally and in
240
+ # tests, so regex-only redaction is used. Once it is unavailable we
241
+ # stop retrying it for the rest of the trace.
242
+ model_results = None
243
+ if model_redact is not None and not model_failed:
244
+ try:
245
+ model_results = model_redact(texts)
246
+ model_used = True
247
+ except Exception as exc: # noqa: BLE001 - graceful degradation
248
+ privacy_notes.append(
249
+ "AI privacy filter was unavailable "
250
+ f"({type(exc).__name__}); regex redaction was applied."
251
+ )
252
+ model_failed = True
253
+ model_results = None
254
+
255
+ for i, message in enumerate(chunk_messages):
256
+ text = texts[i]
257
+ redaction_count += regex_results[i].count
258
+ _accumulate_notes(all_notes, regex_results[i].notes)
259
+ if model_results is not None:
260
+ text = model_results[i].text
261
+ redaction_count += model_results[i].count
262
+ _accumulate_notes(all_notes, model_results[i].notes)
263
+ redacted_messages.append(
264
+ NarrativeMessage(
265
+ index=message.index,
266
+ role=message.role,
267
+ text=text,
268
+ timestamp=message.timestamp,
269
+ source=message.source,
270
+ )
271
  )
272
+ yield (
273
+ "progress",
274
+ {
275
+ "stage": "redact",
276
+ "processed": min(start + chunk, message_count),
277
+ "total": message_count,
278
+ },
279
  )
280
+
281
  messages = redacted_messages
282
+
283
+ if model_used:
284
+ privacy_notes.append(
285
+ "AI privacy filter (openai/privacy-filter) screened for names, "
286
+ "contacts, and other personal data."
287
+ )
288
  if all_notes:
289
  privacy_notes.append(
290
  "Redactions applied: "
 
292
  + "."
293
  )
294
  else:
295
+ privacy_notes.append("No likely secrets matched the redaction patterns.")
296
  else:
297
  privacy_notes.append("Secret redaction was disabled by the user.")
298
+ prof.record("redact", time.perf_counter() - _redact_started)
299
+ prof.mark(redactions=redaction_count)
300
+ if not redact_secrets or message_count == 0:
301
+ # No chunk loop ran (redaction disabled or empty trace) — still advance.
302
+ yield ("progress", {"stage": "redact", "processed": message_count, "total": message_count})
303
 
304
+ _chart_started = time.perf_counter()
305
  episodes = identify_episodes(messages)
306
+ prof.record("chart", time.perf_counter() - _chart_started)
307
+ prof.mark(episodes=len(episodes))
308
+ yield ("progress", {"stage": "chart", "messages": message_count})
309
 
310
+ _classify_started = time.perf_counter()
311
  result = AnalysisResult(
312
  trace_title=derive_trace_title(path, agent_type),
313
  agent_type_guess=agent_type,
 
319
  redaction_count=redaction_count,
320
  engine="deterministic-codebook",
321
  )
322
+ prof.record("classify", time.perf_counter() - _classify_started)
323
+ yield ("progress", {"stage": "classify", "messages": message_count})
324
 
325
+ _synth_started = time.perf_counter()
326
  narrative_text = render_redacted_narrative(messages)
327
+ prof.record("synthesize", time.perf_counter() - _synth_started)
328
+ yield ("progress", {"stage": "synthesize", "messages": message_count})
329
+
330
+ yield ("result", (result, narrative_text, messages))
331
+
332
+
333
+ _PRODUCTIVE_VALUES = {"yes", "no", "mixed", "unknown"}
334
+ _VALID_TONES = {"stable", "iterative", "detour", "partial", "risk", "unknown"}
335
+ _VALID_HONESTY = {"candid", "mixed", "overclaimed"}
336
+
337
+
338
+ def build_numbered_narrative(
339
+ messages: list[NarrativeMessage], *, char_budget: int = 16000, per_message: int = 320
340
+ ) -> str:
341
+ """Number the (redacted) messages by real index for the model.
342
+
343
+ Long traces are sampled evenly across the session (keeping the first and last)
344
+ so the model sees the whole timeline within its context budget; each line keeps
345
+ the message's real index and timestamp so the model can cite spans.
346
+ """
347
+
348
+ if not messages:
349
+ return ""
350
+ max_messages = max(1, char_budget // per_message)
351
+ if len(messages) <= max_messages:
352
+ chosen = messages
353
+ else:
354
+ stride = len(messages) / max_messages
355
+ picks = sorted({0, len(messages) - 1, *(int(i * stride) for i in range(max_messages))})
356
+ chosen = [messages[i] for i in picks if 0 <= i < len(messages)]
357
+ lines = []
358
+ for message in chosen:
359
+ snippet = " ".join(message.text.split())[:per_message]
360
+ lines.append(f"[{message.index}] {message.role} {message.timestamp or ''}: {snippet}")
361
+ return "\n".join(lines)
362
+
363
+
364
+ def build_codebook_hint(episodes: list[DifficultyEpisode]) -> str:
365
+ if not episodes:
366
+ return "(none)"
367
+ return "; ".join(
368
+ f"{ep.episode_id} msgs {ep.message_span.start_index}-{ep.message_span.end_index}"
369
+ for ep in episodes[:12]
370
+ )
371
+
372
+
373
+ def _coerce_code(value: object, vocab: dict[str, str]) -> str:
374
+ code = str(value or "").strip()
375
+ return code if code in vocab else "unknown"
376
+
377
+
378
+ # Weak models sometimes echo the schema placeholders verbatim; drop those.
379
+ _PLACEHOLDER_RE = re.compile(
380
+ r"^\s*(<.*>|<=.*|\d+(\s*-\s*\d+)?\s+sentences?.*|one key.*|short verbatim.*|up to \d+.*|a message index.*)\s*$",
381
+ re.IGNORECASE,
382
+ )
383
+
384
+
385
+ def _clean_text(value: object) -> str:
386
+ text = str(value or "").strip()
387
+ if not text or _PLACEHOLDER_RE.match(text):
388
+ return ""
389
+ return text
390
+
391
+
392
+ def _clean_verdict(verdict: dict) -> dict[str, str]:
393
+ tone = str(verdict.get("tone", "")).strip().lower()
394
+ honesty = str(verdict.get("honesty", "")).strip().lower()
395
+ return {
396
+ "tone": tone if tone in _VALID_TONES else "unknown",
397
+ "headline": _clean_text(verdict.get("headline")) or "Session analyzed by the model.",
398
+ "detail": _clean_text(verdict.get("detail")),
399
+ "honesty": honesty if honesty in _VALID_HONESTY else "mixed",
400
+ }
401
 
 
402
 
403
+ def _episode_from_model(
404
+ raw: dict, ordinal: int, index_to_timestamp: dict[int, str | None], max_index: int
405
+ ) -> DifficultyEpisode:
406
+ def clamp(value: object) -> int:
407
+ try:
408
+ return max(0, min(int(value), max_index))
409
+ except (TypeError, ValueError):
410
+ return 0
411
+
412
+ start = clamp(raw.get("start_index", 0))
413
+ end = clamp(raw.get("end_index", start))
414
+ if end < start:
415
+ start, end = end, start
416
+ start_time = index_to_timestamp.get(start)
417
+ end_time = index_to_timestamp.get(end)
418
+ span = MessageSpan(
419
+ start_index=start,
420
+ end_index=end,
421
+ start_time=start_time,
422
+ end_time=end_time,
423
+ duration_label=duration_label(start_time, end_time) if start_time and end_time else "unknown",
424
+ )
425
+ productive = str(raw.get("productive_detour", "unknown")).strip().lower()
426
+ quotes = [cleaned for q in (raw.get("evidence_quotes") or []) if (cleaned := _clean_text(q))][:3]
427
+ difficulty = _clean_text(raw.get("reported_difficulty"))
428
+ title = _clean_text(raw.get("title")) or (difficulty[:60] if difficulty else "Difficulty episode")
429
+ return DifficultyEpisode(
430
+ episode_id=f"E{ordinal:02d}",
431
+ title=title,
432
+ message_span=span,
433
+ initial_intention=_clean_text(raw.get("initial_intention")),
434
+ reported_difficulty=difficulty,
435
+ difficulty_type=_coerce_code(raw.get("difficulty_type"), DIFFICULTY_TYPES),
436
+ appraisal=_coerce_code(raw.get("appraisal"), APPRAISALS),
437
+ strategy_before=_clean_text(raw.get("strategy_before")),
438
+ strategy_after=_clean_text(raw.get("strategy_after")),
439
+ detour_type=_coerce_code(raw.get("detour_type"), DETOUR_TYPES),
440
+ resolution_mode=_coerce_code(raw.get("resolution_mode"), RESOLUTION_MODES),
441
+ recovery_pattern=_coerce_code(raw.get("recovery_pattern"), RECOVERY_PATTERNS),
442
+ outcome_claim=_coerce_code(raw.get("outcome_claim"), OUTCOME_CLAIMS),
443
+ productive_detour=productive if productive in _PRODUCTIVE_VALUES else "unknown",
444
+ evidence_quotes=quotes,
445
+ analyst_memo=_clean_text(raw.get("analyst_memo")),
446
+ )
447
 
448
+
449
+ def apply_model_analysis(
450
  result: AnalysisResult,
451
+ messages: list[NarrativeMessage],
452
  analysis_engine: str,
453
  *,
454
  run=None,
455
  ) -> None:
456
+ """Replace the deterministic analysis with a model-produced one (codebook is the fallback).
457
 
458
+ ``run`` defaults to :func:`run_model_analysis` (resolved at call time so tests
459
+ can monkeypatch it); the Server passes a GPU- or CPU-bound runner. On success
460
+ the model's episodes, overall patterns, and verdict replace the rule-based
461
+ ones. On any failure the deterministic codebook result is kept and the reason
462
+ recorded in ``model_notes``.
463
  """
464
 
465
  if analysis_engine == "deterministic":
466
  return
467
  if analysis_engine not in MODEL_CHOICES:
468
  result.model_notes.append(
469
+ f"Unknown analysis engine {analysis_engine!r}; rule-based analysis was returned."
470
  )
471
  return
472
+
473
+ runner = run or run_model_analysis
474
+ numbered_narrative = build_numbered_narrative(messages)
475
+ codebook_hint = build_codebook_hint(result.episodes)
476
  try:
477
+ produced = runner(
478
  engine=analysis_engine,
479
+ numbered_narrative=numbered_narrative,
480
+ agent_type=result.agent_type_guess,
481
+ codebook_hint=codebook_hint,
482
  )
483
  except Exception as exc:
484
  error_message = str(exc).strip().rstrip(".")
485
  result.model_notes.append(
486
+ "Model analysis was requested but unavailable: "
487
  f"{type(exc).__name__}: {error_message}. "
488
+ "Rule-based analysis was returned."
489
  )
490
+ return
491
+
492
+ analysis = produced.analysis
493
+ index_to_timestamp = {message.index: message.timestamp for message in messages}
494
+ max_index = (len(messages) - 1) if messages else 0
495
+ episodes = [
496
+ _episode_from_model(raw, ordinal + 1, index_to_timestamp, max_index)
497
+ for ordinal, raw in enumerate(analysis.get("episodes", []))
498
+ ]
499
+ result.episodes = episodes
500
+ patterns = analysis.get("overall_patterns")
501
+ if isinstance(patterns, dict) and patterns:
502
+ result.overall_patterns = {key: str(value) for key, value in patterns.items()}
503
  else:
504
+ result.overall_patterns = summarize_patterns(episodes, messages)
505
+ verdict = analysis.get("verdict")
506
+ if isinstance(verdict, dict) and verdict:
507
+ result.session_verdict = _clean_verdict(verdict)
508
+ result.engine = produced.model_id
509
+ result.model_notes.append(produced.note)
510
 
511
 
512
  def analyze_trace_file(
 
522
 
523
  result: AnalysisResult | None = None
524
  narrative_text = ""
525
+ messages: list[NarrativeMessage] = []
526
  for kind, payload in stream_deterministic_analysis(
527
  path,
528
  include_user_context=include_user_context,
 
530
  ignore_tool_calls=ignore_tool_calls,
531
  ):
532
  if kind == "result":
533
+ result, narrative_text, messages = payload
534
  assert result is not None
535
+ apply_model_analysis(result, messages, analysis_engine)
536
  return result, narrative_text
537
 
538
 
app.py CHANGED
@@ -9,6 +9,7 @@ returns the frontend-ready view model.
9
  from __future__ import annotations
10
 
11
  import os
 
12
  from pathlib import Path
13
 
14
  import spaces
@@ -17,10 +18,13 @@ from fastapi.staticfiles import StaticFiles
17
  from gradio import Server
18
  from gradio.data_classes import FileData
19
 
20
- from analyzer import apply_model_assist, stream_deterministic_analysis
21
  from parser import TraceParseError
 
22
  from view_model import build_view_model
23
 
 
 
24
 
25
  HERE = Path(__file__).resolve().parent
26
  FRONTEND = HERE / "frontend"
@@ -51,8 +55,9 @@ messages and ignores raw tool telemetry.
51
 
52
  - `trace_file` (file): the session log
53
  - `include_user_context` (bool): include user prompts as framing
54
- - `redact_secrets` (bool): redact likely secrets before analysis
55
- - `analysis_engine` (str): `qwen` | `nemotron` | `deterministic`
 
56
 
57
  Returns a JSON view model: a whole-session `verdict`, per-episode difficulty
58
  `episodes`, and redacted export text.
@@ -74,18 +79,101 @@ def agents_md() -> str:
74
 
75
 
76
  @spaces.GPU(size="xlarge", duration=180)
77
- def _model_assist_gpu(*, engine, result, narrative_text):
78
- """Run model assist inside a ZeroGPU allocation."""
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
79
 
80
- from model_runtime import run_model_assist
 
81
 
82
- return run_model_assist(engine=engine, result=result, narrative_text=narrative_text)
83
 
 
84
 
85
- # Completed-step count for the frontend's 6-item checklist. Keep the final
86
- # synthesis row active until the final payload is ready, because model assist
87
- # runs after deterministic synthesis on the ZeroGPU path.
88
- _STEP_COUNT = {"extract": 2, "redact": 3, "chart": 4, "classify": 5, "synthesize": 5}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
89
 
90
 
91
  def _file_fields(trace_file: object) -> tuple[str | None, str | None]:
@@ -101,42 +189,88 @@ def analyze_trace(
101
  trace_file: FileData,
102
  include_user_context: bool = True,
103
  redact_secrets: bool = True,
104
- analysis_engine: str = "qwen",
 
105
  ) -> dict:
106
  """Stream real progress, then the frontend view model, for one trace.
107
 
108
- Yields ``{"step": n}`` after each real pipeline stage (so the UI checklist
109
- tracks actual work), then a final ``{"step": 6, "result": <view model>}``.
 
 
 
 
110
  """
111
 
112
  path, orig_name = _file_fields(trace_file)
113
  if not path:
114
  raise ValueError("No uploaded file was received.")
115
 
 
 
 
 
 
 
 
 
 
 
 
 
 
116
  result = None
117
  narrative = ""
 
 
118
  try:
119
  for kind, payload in stream_deterministic_analysis(
120
  path,
121
  include_user_context=include_user_context,
122
  redact_secrets=redact_secrets,
123
  ignore_tool_calls=True,
 
 
 
124
  ):
125
- if kind == "step":
126
- yield {"step": _STEP_COUNT[payload]}
 
 
 
127
  elif kind == "result":
128
- result, narrative = payload
129
  except TraceParseError as exc:
130
  raise ValueError(str(exc)) from exc
131
 
132
  if analysis_engine != "deterministic":
133
- apply_model_assist(result, narrative, analysis_engine, run=_model_assist_gpu)
 
 
 
 
 
 
 
 
 
134
 
135
  if orig_name:
136
  agent = READABLE_AGENT.get(result.agent_type_guess, "Agent")
137
  result.trace_title = f"{agent} · {orig_name}"
138
 
139
- yield {"step": 6, "result": build_view_model(result, narrative)}
 
 
 
 
 
 
 
 
 
 
 
140
 
141
 
142
  if __name__ == "__main__":
 
9
  from __future__ import annotations
10
 
11
  import os
12
+ import time
13
  from pathlib import Path
14
 
15
  import spaces
 
18
  from gradio import Server
19
  from gradio.data_classes import FileData
20
 
21
+ from analyzer import apply_model_analysis, stream_deterministic_analysis
22
  from parser import TraceParseError
23
+ from profiling import Profiler, get_logger
24
  from view_model import build_view_model
25
 
26
+ logger = get_logger()
27
+
28
 
29
  HERE = Path(__file__).resolve().parent
30
  FRONTEND = HERE / "frontend"
 
55
 
56
  - `trace_file` (file): the session log
57
  - `include_user_context` (bool): include user prompts as framing
58
+ - `redact_secrets` (bool): regex + AI (`openai/privacy-filter`) PII redaction before analysis
59
+ - `analysis_engine` (str): `minicpm` | `nemotron` | `deterministic`
60
+ - `execution_mode` (str): `zerogpu` (default, uses the Space GPU) | `cpu` (no GPU quota, slower)
61
 
62
  Returns a JSON view model: a whole-session `verdict`, per-episode difficulty
63
  `episodes`, and redacted export text.
 
79
 
80
 
81
  @spaces.GPU(size="xlarge", duration=180)
82
+ def _model_analysis_gpu(*, engine, numbered_narrative, agent_type, codebook_hint):
83
+ """Run the primary model analysis inside a ZeroGPU allocation."""
84
+
85
+ from model_runtime import run_model_analysis
86
+
87
+ return run_model_analysis(
88
+ engine=engine,
89
+ numbered_narrative=numbered_narrative,
90
+ agent_type=agent_type,
91
+ codebook_hint=codebook_hint,
92
+ )
93
+
94
+
95
+ @spaces.GPU(size="xlarge", duration=120)
96
+ def _privacy_filter_gpu(texts):
97
+ """Run the openai/privacy-filter PII pass inside a ZeroGPU allocation."""
98
+
99
+ from privacy_filter import redact_texts
100
+
101
+ return redact_texts(texts)
102
+
103
 
104
+ def _cpu_privacy_filter(texts):
105
+ """Run the openai/privacy-filter PII pass on the local CPU (no GPU quota)."""
106
 
107
+ from privacy_filter import redact_texts
108
 
109
+ return redact_texts(texts, device="cpu")
110
 
111
+
112
+ def _cpu_model_analysis(*, engine, numbered_narrative, agent_type, codebook_hint):
113
+ """Run the primary model analysis on the local CPU (no GPU quota)."""
114
+
115
+ from model_runtime import run_model_analysis
116
+
117
+ return run_model_analysis(
118
+ engine=engine,
119
+ numbered_narrative=numbered_narrative,
120
+ agent_type=agent_type,
121
+ codebook_hint=codebook_hint,
122
+ device="cpu",
123
+ )
124
+
125
+
126
+ # Per stage: (frontend checklist index, cumulative %, label). The 6-item
127
+ # checklist is: 0 upload, 1 extract, 2 redact, 3 chart, 4 classify, 5 synthesize.
128
+ # Indices below are "rows completed" so the matching row shows as active.
129
+ _STAGE_PLAN = {
130
+ "extract": (2, 12, "Extracting narrative messages"),
131
+ "chart": (4, 55, "Charting difficulty episodes"),
132
+ "classify": (5, 62, "Classifying with the codebook"),
133
+ "synthesize": (5, 70, "Synthesizing field notes"),
134
+ }
135
+
136
+ # Redaction streams per-chunk progress; its % ramps across this band.
137
+ _REDACT_PCT = (12, 40)
138
+
139
+
140
+ def _progress_event(*, step, pct, label, elapsed, processed=None, total=None):
141
+ """Build one streamed progress payload (with a best-effort ETA)."""
142
+
143
+ event = {"step": step, "pct": pct, "stage": label, "elapsed": round(elapsed, 1)}
144
+ if 0 < pct < 100:
145
+ event["eta"] = round(elapsed * (100 - pct) / pct, 1)
146
+ if total is not None:
147
+ event["total"] = total
148
+ event["processed"] = processed if processed is not None else total
149
+ return event
150
+
151
+
152
+ def _stage_event(payload, *, elapsed, message_total):
153
+ """Translate a stream progress payload into a frontend event + running total."""
154
+
155
+ stage = payload["stage"]
156
+ if stage == "redact":
157
+ total = payload.get("total") or message_total or 0
158
+ processed = payload.get("processed", total)
159
+ frac = (processed / total) if total else 1.0
160
+ low, high = _REDACT_PCT
161
+ pct = round(low + (high - low) * frac)
162
+ step = 2 if (total and processed < total) else 3
163
+ event = _progress_event(
164
+ step=step,
165
+ pct=pct,
166
+ label="Redacting likely secrets",
167
+ elapsed=elapsed,
168
+ processed=processed,
169
+ total=total or None,
170
+ )
171
+ return event, (total or message_total)
172
+
173
+ step, pct, label = _STAGE_PLAN[stage]
174
+ total = payload.get("messages", message_total)
175
+ event = _progress_event(step=step, pct=pct, label=label, elapsed=elapsed, total=total)
176
+ return event, total
177
 
178
 
179
  def _file_fields(trace_file: object) -> tuple[str | None, str | None]:
 
189
  trace_file: FileData,
190
  include_user_context: bool = True,
191
  redact_secrets: bool = True,
192
+ analysis_engine: str = "minicpm",
193
+ execution_mode: str = "zerogpu",
194
  ) -> dict:
195
  """Stream real progress, then the frontend view model, for one trace.
196
 
197
+ Yields ``{"step", "pct", "stage", "elapsed", "eta", "total"}`` after each
198
+ real pipeline stage (so the UI shows true progress), then a final
199
+ ``{"step": 6, "pct": 100, "result": <view model>}``.
200
+
201
+ ``execution_mode`` is ``zerogpu`` (default; models run inside ``@spaces.GPU``)
202
+ or ``cpu`` (models run on the Space/local CPU, no GPU quota — slower).
203
  """
204
 
205
  path, orig_name = _file_fields(trace_file)
206
  if not path:
207
  raise ValueError("No uploaded file was received.")
208
 
209
+ use_cpu = execution_mode == "cpu"
210
+ redactor = _cpu_privacy_filter if use_cpu else _privacy_filter_gpu
211
+ analysis_runner = _cpu_model_analysis if use_cpu else _model_analysis_gpu
212
+
213
+ prof = Profiler(f"analyze[{execution_mode}/{analysis_engine}]")
214
+ logger.info(
215
+ "analyze_trace start: file=%r engine=%s mode=%s redact=%s",
216
+ orig_name,
217
+ analysis_engine,
218
+ execution_mode,
219
+ redact_secrets,
220
+ )
221
+
222
  result = None
223
  narrative = ""
224
+ messages = []
225
+ message_total = None
226
  try:
227
  for kind, payload in stream_deterministic_analysis(
228
  path,
229
  include_user_context=include_user_context,
230
  redact_secrets=redact_secrets,
231
  ignore_tool_calls=True,
232
+ model_redact=redactor,
233
+ profiler=prof,
234
+ stream_redact_progress=use_cpu,
235
  ):
236
+ if kind == "progress":
237
+ event, message_total = _stage_event(
238
+ payload, elapsed=prof.elapsed(), message_total=message_total
239
+ )
240
+ yield event
241
  elif kind == "result":
242
+ result, narrative, messages = payload
243
  except TraceParseError as exc:
244
  raise ValueError(str(exc)) from exc
245
 
246
  if analysis_engine != "deterministic":
247
+ yield _progress_event(
248
+ step=5,
249
+ pct=78,
250
+ label=f"Reading the trace with {analysis_engine}",
251
+ elapsed=prof.elapsed(),
252
+ total=message_total,
253
+ )
254
+ analysis_started = time.perf_counter()
255
+ apply_model_analysis(result, messages, analysis_engine, run=analysis_runner)
256
+ prof.record("model_analysis", time.perf_counter() - analysis_started)
257
 
258
  if orig_name:
259
  agent = READABLE_AGENT.get(result.agent_type_guess, "Agent")
260
  result.trace_title = f"{agent} · {orig_name}"
261
 
262
+ view = build_view_model(result, narrative)
263
+ prof.mark(engine=result.engine, mode=execution_mode)
264
+ prof.summary()
265
+ yield {
266
+ "step": 6,
267
+ "pct": 100,
268
+ "stage": "Field notes ready",
269
+ "elapsed": round(prof.elapsed(), 1),
270
+ "total": message_total,
271
+ "processed": message_total,
272
+ "result": view,
273
+ }
274
 
275
 
276
  if __name__ == "__main__":
frontend/static/app.jsx CHANGED
@@ -31,19 +31,23 @@ function TopBar() {
31
  </div>
32
  </div>
33
  <div className="topbar__right mono">
34
- <span className="topbar__pill">narrative-only</span>
35
- <span className="topbar__pill">privacy-first</span>
36
  </div>
37
  </header>
38
  );
39
  }
40
 
41
  const ENGINES = [
42
- ["qwen", "Quick analysis", "Qwen3.5 9B"],
43
  ["nemotron", "Deeper analysis", "Nemotron 3 Nano 30B-A3B"],
44
  ["deterministic", "Rule-based", "no model, always on"],
45
  ];
46
 
 
 
 
 
 
47
  function Toggle({ on, set, label, sub, locked }) {
48
  return (
49
  <button className={"toggle" + (on ? " toggle--on" : "") + (locked ? " toggle--locked" : "")}
@@ -61,7 +65,8 @@ function LandingView({ onAnalyze, onSample, error }) {
61
  const [staged, setStaged] = React.useState(null); // { name, file }
62
  const [redact, setRedact] = React.useState(true);
63
  const [userCtx, setUserCtx] = React.useState(true);
64
- const [engine, setEngine] = React.useState("qwen");
 
65
  const [dragOver, setDragOver] = React.useState(false);
66
  const [copied, setCopied] = React.useState(false);
67
  const fileRef = React.useRef(null);
@@ -76,7 +81,7 @@ function LandingView({ onAnalyze, onSample, error }) {
76
  function pick() { if (fileRef.current) fileRef.current.click(); }
77
  function run() {
78
  if (!staged) return;
79
- onAnalyze({ file: staged.file, include_user_context: userCtx, redact_secrets: redact, analysis_engine: engine, engineLabel });
80
  }
81
 
82
  const AGENT_PROMPT = `Use this Space as a tool.
@@ -92,7 +97,7 @@ function LandingView({ onAnalyze, onSample, error }) {
92
  <TopBar />
93
 
94
  <section className="hero">
95
- <h1 className="hero__title">See how your coding agent<br /> got stuck, detoured, recovered<span className="hero__amp"> &amp; </span>claimed success.</h1>
96
  <p className="hero__sub">
97
  Upload a Codex, Claude Code, or Pi Agent session log. Trace Field Notes reads only the agent's
98
  <em> narrated</em> messages — what it planned, where it snagged, how it rerouted, and how honestly it called it done —
@@ -104,7 +109,7 @@ function LandingView({ onAnalyze, onSample, error }) {
104
  <span className="privacy__mark">!</span>
105
  <p>
106
  Agent traces can carry prompts, command output, local paths, screenshots, secrets, and private code.
107
- <b> Review and redact before uploading or sharing.</b> This app analyzes only visible narrative messages and ignores raw tool telemetry by default.
108
  </p>
109
  </div>
110
 
@@ -145,7 +150,7 @@ function LandingView({ onAnalyze, onSample, error }) {
145
  </div>
146
 
147
  <div className="opts">
148
- <Toggle on={redact} set={setRedact} label="Redact likely secrets" sub="emails, tokens, keys, paths" />
149
  <Toggle on={userCtx} set={setUserCtx} label="Include user context" sub="user prompts as framing" />
150
  <Toggle on={true} set={() => {}} locked label="Ignore tool contents" sub="locked for this release" />
151
  </div>
@@ -162,7 +167,22 @@ function LandingView({ onAnalyze, onSample, error }) {
162
  </button>
163
  ))}
164
  </div>
165
- <p className="engine__note muted">Quick uses Qwen3.5 9B on the Space GPU. Deeper uses Nemotron 3 Nano 30B-A3B. Rule-based needs no model and never fails.</p>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
166
  </div>
167
 
168
  <div className="panel__actions">
@@ -235,7 +255,14 @@ const PIPELINE = [
235
  "Synthesizing field notes",
236
  ];
237
 
238
- function Analyzing({ label, step }) {
 
 
 
 
 
 
 
239
  return (
240
  <div className="analyzing">
241
  <div className="analyzing__card card card--raised">
@@ -247,6 +274,25 @@ function Analyzing({ label, step }) {
247
  <circle className="analyzing__dot" r="4.5" fill="var(--accent)" />
248
  </svg>
249
  <Kicker>Surveying the trace · {label}</Kicker>
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
250
  <ul className="analyzing__steps">
251
  {PIPELINE.map((s, i) => (
252
  <li key={s} className={i < step ? "done" : i === step ? "active" : ""}>
@@ -286,11 +332,13 @@ function App() {
286
  const [engineLabel, setEngineLabel] = React.useState("");
287
  const [error, setError] = React.useState("");
288
  const [step, setStep] = React.useState(0);
 
289
 
290
- async function analyze({ file, include_user_context, redact_secrets, analysis_engine, engineLabel }) {
291
  setError("");
292
  setEngineLabel(engineLabel || analysis_engine);
293
  setStep(0);
 
294
  setStage("analyzing");
295
  window.scrollTo({ top: 0 });
296
  try {
@@ -302,6 +350,7 @@ function App() {
302
  include_user_context: !!include_user_context,
303
  redact_secrets: !!redact_secrets,
304
  analysis_engine,
 
305
  });
306
  let result = null;
307
  for await (const msg of sub) {
@@ -313,6 +362,16 @@ function App() {
313
  } else if (typeof p.step === "number") {
314
  setStep(Math.min(p.step, PIPELINE.length - 1));
315
  }
 
 
 
 
 
 
 
 
 
 
316
  }
317
  } else if (msg.type === "status") {
318
  if (msg.stage === "error") throw new Error(msg.message || "The analyzer failed on the server.");
@@ -349,7 +408,7 @@ function App() {
349
  <div className="backdrop"><div className="grain" /><TopoBackground /></div>
350
  <div className="page">
351
  {stage === "landing" && <LandingView onAnalyze={analyze} onSample={loadSample} error={error} />}
352
- {stage === "analyzing" && <Analyzing label={engineLabel} step={step} />}
353
  {stage === "report" && (
354
  <div className="report-wrap">
355
  <button className="report-back btn btn--sm btn--ghost" onClick={reset}>← New trace</button>
 
31
  </div>
32
  </div>
33
  <div className="topbar__right mono">
34
+ <span className="topbar__pill">build small</span>
 
35
  </div>
36
  </header>
37
  );
38
  }
39
 
40
  const ENGINES = [
41
+ ["minicpm", "Quick analysis", "MiniCPM5 1B"],
42
  ["nemotron", "Deeper analysis", "Nemotron 3 Nano 30B-A3B"],
43
  ["deterministic", "Rule-based", "no model, always on"],
44
  ];
45
 
46
+ const EXEC_MODES = [
47
+ ["zerogpu", "GPU", "Space GPU · faster"],
48
+ ["cpu", "CPU", "no GPU quota · slower"],
49
+ ];
50
+
51
  function Toggle({ on, set, label, sub, locked }) {
52
  return (
53
  <button className={"toggle" + (on ? " toggle--on" : "") + (locked ? " toggle--locked" : "")}
 
65
  const [staged, setStaged] = React.useState(null); // { name, file }
66
  const [redact, setRedact] = React.useState(true);
67
  const [userCtx, setUserCtx] = React.useState(true);
68
+ const [engine, setEngine] = React.useState("minicpm");
69
+ const [execMode, setExecMode] = React.useState("zerogpu");
70
  const [dragOver, setDragOver] = React.useState(false);
71
  const [copied, setCopied] = React.useState(false);
72
  const fileRef = React.useRef(null);
 
81
  function pick() { if (fileRef.current) fileRef.current.click(); }
82
  function run() {
83
  if (!staged) return;
84
+ onAnalyze({ file: staged.file, include_user_context: userCtx, redact_secrets: redact, analysis_engine: engine, execution_mode: execMode, engineLabel });
85
  }
86
 
87
  const AGENT_PROMPT = `Use this Space as a tool.
 
97
  <TopBar />
98
 
99
  <section className="hero">
100
+ <h1 className="hero__title">See how your coding agent<br /> got stuck, detoured, recovered<span className="hero__amp"> &amp; </span><br />claimed success.</h1>
101
  <p className="hero__sub">
102
  Upload a Codex, Claude Code, or Pi Agent session log. Trace Field Notes reads only the agent's
103
  <em> narrated</em> messages — what it planned, where it snagged, how it rerouted, and how honestly it called it done —
 
109
  <span className="privacy__mark">!</span>
110
  <p>
111
  Agent traces can carry prompts, command output, local paths, screenshots, secrets, and private code.
112
+ <b> Review and redact before uploading or sharing.</b> This app analyzes only visible narrative messages, ignores raw tool telemetry by default, and scrubs secrets and personal data with pattern rules plus OpenAI's privacy-filter model.
113
  </p>
114
  </div>
115
 
 
150
  </div>
151
 
152
  <div className="opts">
153
+ <Toggle on={redact} set={setRedact} label="Redact secrets & personal data" sub="regex + AI: names, contacts, tokens, keys, paths" />
154
  <Toggle on={userCtx} set={setUserCtx} label="Include user context" sub="user prompts as framing" />
155
  <Toggle on={true} set={() => {}} locked label="Ignore tool contents" sub="locked for this release" />
156
  </div>
 
167
  </button>
168
  ))}
169
  </div>
170
+ <p className="engine__note muted">Quick uses MiniCPM5 1B on the Space GPU. Deeper uses Nemotron 3 Nano 30B-A3B. Rule-based needs no model and never fails.</p>
171
+ </div>
172
+
173
+ <div className="engine">
174
+ <Label>Run on</Label>
175
+ <div className="engine__opts">
176
+ {EXEC_MODES.map(([key, name, detail]) => (
177
+ <button key={key}
178
+ className={"engine__opt" + (execMode === key ? " engine__opt--on" : "")}
179
+ onClick={() => setExecMode(key)}>
180
+ <span className="engine__name">{name}</span>
181
+ <span className="engine__detail mono">{detail}</span>
182
+ </button>
183
+ ))}
184
+ </div>
185
+ <p className="engine__note muted">ZeroGPU is fast but spends your Space GPU quota. CPU needs no quota and still works if you've run out — just slower, so the progress bar will move more gradually.</p>
186
  </div>
187
 
188
  <div className="panel__actions">
 
255
  "Synthesizing field notes",
256
  ];
257
 
258
+ function fmtSeconds(s) {
259
+ if (s == null || isNaN(s)) return "—";
260
+ const m = Math.floor(s / 60), sec = Math.round(s % 60);
261
+ return m > 0 ? `${m}m ${sec}s` : `${sec}s`;
262
+ }
263
+
264
+ function Analyzing({ label, step, progress }) {
265
+ const pct = progress && typeof progress.pct === "number" ? Math.max(0, Math.min(100, progress.pct)) : null;
266
  return (
267
  <div className="analyzing">
268
  <div className="analyzing__card card card--raised">
 
274
  <circle className="analyzing__dot" r="4.5" fill="var(--accent)" />
275
  </svg>
276
  <Kicker>Surveying the trace · {label}</Kicker>
277
+ {pct != null && (
278
+ <div style={{ margin: "12px 0 2px" }}>
279
+ <div style={{ height: 6, borderRadius: 4, background: "var(--rule)", overflow: "hidden" }}>
280
+ <div style={{ width: pct + "%", height: "100%", background: "var(--accent)", transition: "width .45s ease" }} />
281
+ </div>
282
+ <div className="mono muted" style={{ display: "flex", justifyContent: "space-between", gap: 12, fontSize: 12, marginTop: 7 }}>
283
+ <span>{pct}%{progress.stage ? " · " + progress.stage : ""}</span>
284
+ <span>
285
+ {progress.total != null
286
+ ? (progress.processed != null && progress.processed < progress.total
287
+ ? progress.processed + "/" + progress.total
288
+ : progress.total) + " msgs · "
289
+ : ""}
290
+ {fmtSeconds(progress.elapsed)} elapsed
291
+ {progress.eta != null && pct < 100 ? " · ~" + fmtSeconds(progress.eta) + " left" : ""}
292
+ </span>
293
+ </div>
294
+ </div>
295
+ )}
296
  <ul className="analyzing__steps">
297
  {PIPELINE.map((s, i) => (
298
  <li key={s} className={i < step ? "done" : i === step ? "active" : ""}>
 
332
  const [engineLabel, setEngineLabel] = React.useState("");
333
  const [error, setError] = React.useState("");
334
  const [step, setStep] = React.useState(0);
335
+ const [progress, setProgress] = React.useState(null);
336
 
337
+ async function analyze({ file, include_user_context, redact_secrets, analysis_engine, execution_mode, engineLabel }) {
338
  setError("");
339
  setEngineLabel(engineLabel || analysis_engine);
340
  setStep(0);
341
+ setProgress(null);
342
  setStage("analyzing");
343
  window.scrollTo({ top: 0 });
344
  try {
 
350
  include_user_context: !!include_user_context,
351
  redact_secrets: !!redact_secrets,
352
  analysis_engine,
353
+ execution_mode,
354
  });
355
  let result = null;
356
  for await (const msg of sub) {
 
362
  } else if (typeof p.step === "number") {
363
  setStep(Math.min(p.step, PIPELINE.length - 1));
364
  }
365
+ if (typeof p.pct === "number") {
366
+ setProgress({
367
+ pct: p.pct,
368
+ elapsed: p.elapsed,
369
+ eta: p.eta,
370
+ total: p.total,
371
+ processed: p.processed,
372
+ stage: p.stage,
373
+ });
374
+ }
375
  }
376
  } else if (msg.type === "status") {
377
  if (msg.stage === "error") throw new Error(msg.message || "The analyzer failed on the server.");
 
408
  <div className="backdrop"><div className="grain" /><TopoBackground /></div>
409
  <div className="page">
410
  {stage === "landing" && <LandingView onAnalyze={analyze} onSample={loadSample} error={error} />}
411
+ {stage === "analyzing" && <Analyzing label={engineLabel} step={step} progress={progress} />}
412
  {stage === "report" && (
413
  <div className="report-wrap">
414
  <button className="report-back btn btn--sm btn--ghost" onClick={reset}>← New trace</button>
frontend/static/components.jsx CHANGED
@@ -414,12 +414,22 @@ function ReportHeader({ data }) {
414
  }
415
 
416
  function ModelStatus({ data }) {
417
- const notes = (data.privacy_notes || []).filter((note) => String(note).startsWith("Model assist"));
 
 
418
  if (!notes.length) return null;
 
 
 
419
  return (
420
  <div className="privacy model-status">
421
- <span className="privacy__mark">!</span>
422
- <p><b>Model assist fell back to the rule-based analyzer.</b> {notes.join(" ")}</p>
 
 
 
 
 
423
  </div>
424
  );
425
  }
 
414
  }
415
 
416
  function ModelStatus({ data }) {
417
+ const notes = (data.privacy_notes || []).filter((note) =>
418
+ /^(Analysis produced|Model analysis|Model assist|Unknown analysis engine)/.test(String(note))
419
+ );
420
  if (!notes.length) return null;
421
+ const fellBack = notes.some((note) =>
422
+ /unavailable|rule-based analysis was returned|deterministic analysis was returned|unknown analysis engine/i.test(note)
423
+ );
424
  return (
425
  <div className="privacy model-status">
426
+ <span className="privacy__mark">{fellBack ? "!" : "✓"}</span>
427
+ <p>
428
+ <b>{fellBack
429
+ ? "Model unavailable — showing the rule-based analysis instead."
430
+ : "This report was written by the model."}</b>{" "}
431
+ {notes.join(" ")}
432
+ </p>
433
  </div>
434
  );
435
  }
model_runtime.py CHANGED
@@ -12,20 +12,31 @@ from __future__ import annotations
12
 
13
  import json
14
  import re
 
15
  from collections.abc import Mapping
16
  from dataclasses import dataclass
17
  from typing import Any, Callable
18
 
19
- from schemas import AnalysisResult
 
 
 
 
 
 
 
 
 
 
20
 
21
 
22
  PRIMARY_MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
23
- QUICK_MODEL_ID = "Qwen/Qwen3.5-9B"
24
  MODEL_MAX_NEW_TOKENS = 8192
25
 
26
  MODEL_CHOICES = {
27
- "qwen": {
28
- "label": "Qwen3.5 9B — quick analysis",
29
  "model_id": QUICK_MODEL_ID,
30
  },
31
  "nemotron": {
@@ -45,9 +56,9 @@ _MODEL_CACHE: dict[str, Any] = {}
45
 
46
 
47
  @dataclass(slots=True)
48
- class ModelAssistResult:
49
  model_id: str
50
- memo: dict[str, Any]
51
  note: str
52
 
53
 
@@ -59,38 +70,82 @@ def model_id_for_engine(engine: str) -> str | None:
59
  return str(model_id) if model_id else None
60
 
61
 
62
- def run_model_assist(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
63
  *,
64
  engine: str,
65
- result: AnalysisResult,
66
- narrative_text: str,
 
67
  generate: GenerateFn | None = None,
68
- ) -> ModelAssistResult:
69
- """Run the selected model on the GPU and return a concise grounded memo."""
 
 
 
 
 
 
 
 
70
 
71
  model_id = model_id_for_engine(engine)
72
  if not model_id:
73
  raise ValueError(f"No model is configured for analysis engine {engine!r}.")
74
 
75
- prompt = build_model_prompt(result, narrative_text)
 
 
76
  messages = [
77
  {
78
  "role": "system",
79
  "content": (
80
- "You analyze visible coding-agent narrative messages. "
81
- "Do not infer hidden reasoning. Return JSON only."
 
82
  ),
83
  },
84
  {"role": "user", "content": prompt},
85
  ]
86
 
87
- generator = generate or _local_generator
88
- content = generator(messages, model_id=model_id, max_new_tokens=MODEL_MAX_NEW_TOKENS)
89
- memo = parse_model_json(content)
90
- return ModelAssistResult(
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
91
  model_id=model_id,
92
- memo=memo,
93
- note=f"Model assist completed on the Space GPU with {model_id}.",
94
  )
95
 
96
 
@@ -99,16 +154,18 @@ def _local_generator(
99
  *,
100
  model_id: str,
101
  max_new_tokens: int,
 
102
  ) -> str:
103
- """Generate text with a locally loaded model on the ZeroGPU device.
104
 
105
- Imported lazily: ``torch`` only needs to exist on the GPU Space, never for
106
- the deterministic path, tests, or local development.
 
107
  """
108
 
109
  import torch
110
 
111
- tokenizer, model = _load_model(model_id)
112
  chat_inputs = tokenizer.apply_chat_template(
113
  messages,
114
  add_generation_prompt=True,
@@ -163,78 +220,146 @@ def _move_to_device(value: Any, device: Any) -> Any:
163
  def _chat_template_kwargs(model_id: str) -> dict[str, Any]:
164
  """Model-specific chat-template controls."""
165
 
166
- if model_id.startswith("Qwen/"):
167
- return {"enable_thinking": True}
 
 
168
  return {}
169
 
170
 
171
- def _load_model(model_id: str) -> Any:
172
- """Lazily load and cache a (tokenizer, model) pair on the GPU.
173
 
174
  The cache keeps weights resident across requests so only the first call per
175
- model pays the load cost. ZeroGPU exposes CUDA inside the ``@spaces.GPU``
176
- context, which is where this runs.
 
177
  """
178
 
179
- cached = _MODEL_CACHE.get(model_id)
 
 
 
 
180
  if cached is not None:
181
  return cached
182
 
183
- import torch
184
  from transformers import AutoModelForCausalLM, AutoTokenizer
185
 
 
186
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
187
- model = AutoModelForCausalLM.from_pretrained(
188
- model_id,
189
- torch_dtype=torch.bfloat16,
190
- device_map="cuda",
191
- trust_remote_code=True,
192
- )
 
 
 
 
 
 
 
 
 
 
193
  model.eval()
194
- _MODEL_CACHE[model_id] = (tokenizer, model)
 
195
  return tokenizer, model
196
 
197
 
198
- def build_model_prompt(result: AnalysisResult, narrative_text: str) -> str:
199
- deterministic_json = json.dumps(result.to_dict(), ensure_ascii=False, indent=2)
200
- narrative_excerpt = narrative_text[:12000]
201
- return f"""Use the deterministic codebook analysis and redacted visible narrative below.
202
-
203
- Return JSON with exactly these keys:
204
- - executive_memo: 4-6 sentences for a developer
205
- - detour_memo: 2-4 sentences about productive detours vs wandering
206
- - outcome_audit_memo: 2-4 sentences about completion claims and caveats
207
- - caveats: array of short strings
208
 
209
- Rules:
210
- - Return one valid JSON object and nothing else.
211
- - The first non-whitespace character must be {{ and the last must be }}.
212
- - Analyze only visible narrative messages.
213
- - Do not claim to know hidden reasoning.
214
- - Cite episode IDs where useful.
215
- - Do not include raw secrets, tool outputs, or long quotes.
216
 
217
- Deterministic analysis:
218
- {deterministic_json}
219
-
220
- Redacted narrative excerpt:
221
- {narrative_excerpt}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
222
  """
223
 
224
 
225
- def parse_model_json(content: str) -> dict[str, Any]:
226
- parsed = _loads_lenient(content)
227
 
228
- required = {
229
- "executive_memo": str,
230
- "detour_memo": str,
231
- "outcome_audit_memo": str,
232
- "caveats": list,
233
- }
234
- for key, expected_type in required.items():
235
- if key not in parsed or not isinstance(parsed[key], expected_type):
236
- raise ValueError(f"Model response missing {key!r} as {expected_type.__name__}.")
237
- parsed["caveats"] = [str(item) for item in parsed["caveats"][:6]]
238
  return parsed
239
 
240
 
 
12
 
13
  import json
14
  import re
15
+ import time
16
  from collections.abc import Mapping
17
  from dataclasses import dataclass
18
  from typing import Any, Callable
19
 
20
+ from profiling import get_logger
21
+ from schemas import (
22
+ APPRAISALS,
23
+ DETOUR_TYPES,
24
+ DIFFICULTY_TYPES,
25
+ OUTCOME_CLAIMS,
26
+ RECOVERY_PATTERNS,
27
+ RESOLUTION_MODES,
28
+ )
29
+
30
+ logger = get_logger()
31
 
32
 
33
  PRIMARY_MODEL_ID = "nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-BF16"
34
+ QUICK_MODEL_ID = "openbmb/MiniCPM5-1B"
35
  MODEL_MAX_NEW_TOKENS = 8192
36
 
37
  MODEL_CHOICES = {
38
+ "minicpm": {
39
+ "label": "MiniCPM5 1B — quick analysis",
40
  "model_id": QUICK_MODEL_ID,
41
  },
42
  "nemotron": {
 
56
 
57
 
58
  @dataclass(slots=True)
59
+ class ModelAnalysisResult:
60
  model_id: str
61
+ analysis: dict[str, Any]
62
  note: str
63
 
64
 
 
70
  return str(model_id) if model_id else None
71
 
72
 
73
+ def resolve_device(device: str | None = None) -> str:
74
+ """Pick the compute device: explicit override, else cuda -> mps -> cpu."""
75
+
76
+ if device:
77
+ return device
78
+ import torch
79
+
80
+ if torch.cuda.is_available():
81
+ return "cuda"
82
+ mps = getattr(torch.backends, "mps", None)
83
+ if mps is not None and mps.is_available():
84
+ return "mps"
85
+ return "cpu"
86
+
87
+
88
+ def run_model_analysis(
89
  *,
90
  engine: str,
91
+ numbered_narrative: str,
92
+ agent_type: str = "unknown",
93
+ codebook_hint: str = "",
94
  generate: GenerateFn | None = None,
95
+ device: str | None = None,
96
+ ) -> ModelAnalysisResult:
97
+ """Run the selected model as the primary analyst and return a field report.
98
+
99
+ The model identifies and classifies the difficulty episodes and writes the
100
+ session verdict directly from the visible narrative; the deterministic codebook
101
+ is only a fallback (used by the caller if this raises). ``device`` forces the
102
+ compute device for the default local generator; an injected ``generate`` is
103
+ used as-is.
104
+ """
105
 
106
  model_id = model_id_for_engine(engine)
107
  if not model_id:
108
  raise ValueError(f"No model is configured for analysis engine {engine!r}.")
109
 
110
+ prompt = build_analysis_prompt(
111
+ numbered_narrative, agent_type=agent_type, codebook_hint=codebook_hint
112
+ )
113
  messages = [
114
  {
115
  "role": "system",
116
  "content": (
117
+ "You are an expert analyst of coding-agent session traces. "
118
+ "Judge only the visible narrative; never invent hidden reasoning. "
119
+ "Return one JSON object and nothing else."
120
  ),
121
  },
122
  {"role": "user", "content": prompt},
123
  ]
124
 
125
+ started = time.perf_counter()
126
+ if generate is not None:
127
+ content = generate(messages, model_id=model_id, max_new_tokens=MODEL_MAX_NEW_TOKENS)
128
+ device_label = "injected"
129
+ else:
130
+ device_label = resolve_device(device)
131
+ content = _local_generator(
132
+ messages,
133
+ model_id=model_id,
134
+ max_new_tokens=MODEL_MAX_NEW_TOKENS,
135
+ device=device_label,
136
+ )
137
+ logger.info(
138
+ "model analysis: %s on %s in %.2fs (%d chars in)",
139
+ model_id,
140
+ device_label,
141
+ time.perf_counter() - started,
142
+ len(numbered_narrative),
143
+ )
144
+ analysis = parse_analysis_json(content)
145
+ return ModelAnalysisResult(
146
  model_id=model_id,
147
+ analysis=analysis,
148
+ note=f"Analysis produced by {model_id}.",
149
  )
150
 
151
 
 
154
  *,
155
  model_id: str,
156
  max_new_tokens: int,
157
+ device: str | None = None,
158
  ) -> str:
159
+ """Generate text with a locally loaded model on the chosen device.
160
 
161
+ Imported lazily: ``torch`` only needs to exist on the GPU Space (or a local
162
+ machine running the model), never for the deterministic path, tests, or
163
+ light local development.
164
  """
165
 
166
  import torch
167
 
168
+ tokenizer, model = _load_model(model_id, device=device)
169
  chat_inputs = tokenizer.apply_chat_template(
170
  messages,
171
  add_generation_prompt=True,
 
220
  def _chat_template_kwargs(model_id: str) -> dict[str, Any]:
221
  """Model-specific chat-template controls."""
222
 
223
+ if model_id.startswith("openbmb/"):
224
+ # MiniCPM5 supports hybrid reasoning; the quick engine keeps thinking
225
+ # off for fast, reliably parseable JSON memos.
226
+ return {"enable_thinking": False}
227
  return {}
228
 
229
 
230
+ def _load_model(model_id: str, device: str | None = None) -> Any:
231
+ """Lazily load and cache a (tokenizer, model) pair on the chosen device.
232
 
233
  The cache keeps weights resident across requests so only the first call per
234
+ (model, device) pays the load cost. ZeroGPU exposes CUDA inside the
235
+ ``@spaces.GPU`` context; CPU/MPS support lets the app run off-Space (e.g. for
236
+ users without GPU quota, or local development).
237
  """
238
 
239
+ import torch
240
+
241
+ resolved = resolve_device(device)
242
+ cache_key = f"{model_id}@{resolved}"
243
+ cached = _MODEL_CACHE.get(cache_key)
244
  if cached is not None:
245
  return cached
246
 
 
247
  from transformers import AutoModelForCausalLM, AutoTokenizer
248
 
249
+ started = time.perf_counter()
250
  tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
251
+ if resolved == "cuda":
252
+ # The ZeroGPU Space path: load straight onto the GPU in bfloat16.
253
+ model = AutoModelForCausalLM.from_pretrained(
254
+ model_id,
255
+ dtype=torch.bfloat16,
256
+ device_map="cuda",
257
+ trust_remote_code=True,
258
+ )
259
+ else:
260
+ # CPU / Apple MPS: fp16 on MPS, fp32 on CPU for numerical stability.
261
+ dtype = torch.float16 if resolved == "mps" else torch.float32
262
+ model = AutoModelForCausalLM.from_pretrained(
263
+ model_id,
264
+ dtype=dtype,
265
+ trust_remote_code=True,
266
+ ).to(resolved)
267
  model.eval()
268
+ logger.info("loaded %s on %s in %.1fs", model_id, resolved, time.perf_counter() - started)
269
+ _MODEL_CACHE[cache_key] = (tokenizer, model)
270
  return tokenizer, model
271
 
272
 
273
+ def _vocab_block(name: str, vocab: dict[str, str]) -> str:
274
+ return f"{name}:\n" + "\n".join(f"- {key}: {meaning}" for key, meaning in vocab.items())
 
 
 
 
 
 
 
 
275
 
 
 
 
 
 
 
 
276
 
277
+ def build_analysis_prompt(
278
+ numbered_narrative: str, *, agent_type: str = "unknown", codebook_hint: str = ""
279
+ ) -> str:
280
+ narrative = numbered_narrative[:16000]
281
+ vocab = "\n\n".join(
282
+ [
283
+ _vocab_block("difficulty_type", DIFFICULTY_TYPES),
284
+ _vocab_block("appraisal", APPRAISALS),
285
+ _vocab_block("detour_type", DETOUR_TYPES),
286
+ _vocab_block("resolution_mode", RESOLUTION_MODES),
287
+ _vocab_block("recovery_pattern", RECOVERY_PATTERNS),
288
+ _vocab_block("outcome_claim", OUTCOME_CLAIMS),
289
+ ]
290
+ )
291
+ return f"""Read the agent's visible narrative and produce a structured field report as JSON.
292
+
293
+ Identify the real DIFFICULTY EPISODES — moments where the agent hit a snag, reassessed,
294
+ detoured, recovered, or claimed completion. Ignore instructions, skill files, prompts,
295
+ or boilerplate the agent merely read or quoted; those are NOT difficulties. Merge
296
+ duplicates. Prefer 1-8 substantive episodes; if there is genuinely no difficulty,
297
+ return an empty episodes list.
298
+
299
+ Return ONE JSON object (first character {{ and last character }}), no prose, EXACTLY:
300
+ {{
301
+ "verdict": {{
302
+ "tone": one of ["stable","iterative","detour","partial","risk","unknown"],
303
+ "headline": "<= 12 words, plain language",
304
+ "detail": "2-4 sentences a developer can act on",
305
+ "honesty": one of ["candid","mixed","overclaimed"]
306
+ }},
307
+ "overall_patterns": {{
308
+ "difficulty_style": "1 sentence", "detour_style": "1 sentence",
309
+ "recovery_style": "1 sentence", "risk_or_caveat": "1 sentence"
310
+ }},
311
+ "episodes": [
312
+ {{
313
+ "start_index": <a message index shown below>,
314
+ "end_index": <a message index shown below>,
315
+ "title": "<= 10 words",
316
+ "initial_intention": "1 sentence", "reported_difficulty": "1-2 sentences",
317
+ "difficulty_type": "<one key below>", "appraisal": "<one key below>",
318
+ "strategy_before": "1 sentence", "strategy_after": "1 sentence",
319
+ "detour_type": "<one key below>", "resolution_mode": "<one key below>",
320
+ "recovery_pattern": "<one key below>", "outcome_claim": "<one key below>",
321
+ "productive_detour": one of ["yes","no","mixed","unknown"],
322
+ "evidence_quotes": ["short verbatim quote", "up to 3"],
323
+ "analyst_memo": "1-3 sentences of real insight, NOT a restatement of the codes"
324
+ }}
325
+ ]
326
+ }}
327
+
328
+ Controlled vocabulary (use these keys exactly):
329
+ {vocab}
330
+
331
+ Guidance:
332
+ - Every field must contain real content drawn from the trace. NEVER output a
333
+ placeholder such as "<= 10 words", "1 sentence", or "<one key below>" literally.
334
+ - difficulty_type, appraisal, detour_type, resolution_mode, recovery_pattern, and
335
+ outcome_claim must each be EXACTLY one key from the vocabulary above (lowercase,
336
+ with underscores). If unsure, use "unknown".
337
+ - Be accurate, not generous. If the agent ended unresolved or overclaimed, say so in tone/honesty.
338
+ - honesty = "overclaimed" when a success claim outruns the visible evidence.
339
+ - start_index / end_index must be message indices that appear below.
340
+ - Quote the agent's own words; keep the original language of the quote.
341
+ - Do not include secrets or long tool dumps.
342
+
343
+ Agent type: {agent_type}
344
+ Rule-based pre-scan candidate spans (hints only — keep, drop, merge, or add freely): {codebook_hint or "(none)"}
345
+
346
+ Numbered visible messages:
347
+ {narrative}
348
  """
349
 
350
 
351
+ def parse_analysis_json(content: str) -> dict[str, Any]:
352
+ """Validate the structural shape of the model's field report (codes coerced later)."""
353
 
354
+ parsed = _loads_lenient(content)
355
+ episodes = parsed.get("episodes")
356
+ if not isinstance(episodes, list):
357
+ raise ValueError("Model response did not include an 'episodes' list.")
358
+ parsed["episodes"] = [episode for episode in episodes if isinstance(episode, dict)]
359
+ if not isinstance(parsed.get("overall_patterns"), dict):
360
+ parsed["overall_patterns"] = {}
361
+ if not isinstance(parsed.get("verdict"), dict):
362
+ parsed["verdict"] = {}
 
363
  return parsed
364
 
365
 
privacy_filter.py ADDED
@@ -0,0 +1,180 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Optional model-based PII redaction using ``openai/privacy-filter``.
2
+
3
+ The deterministic pipeline always runs regex redaction (:mod:`redaction`). On the
4
+ Hugging Face Space GPU this module adds a second pass: a token-classification
5
+ model (``openai/privacy-filter``) flags personal or sensitive spans that regex
6
+ patterns miss — names, phone numbers, postal addresses, and the like — and masks
7
+ them with typed placeholders.
8
+
9
+ Heavy imports (``torch``/``transformers``) load lazily so the deterministic
10
+ analyzer, the test suite, and local development keep working without GPU
11
+ dependencies. If the model cannot be loaded, the caller falls back to regex-only
12
+ redaction and records the reason in the privacy notes.
13
+ """
14
+
15
+ from __future__ import annotations
16
+
17
+ import functools
18
+ import time
19
+ from collections import Counter
20
+ from typing import Any, Callable
21
+
22
+ from model_runtime import resolve_device
23
+ from profiling import get_logger
24
+ from redaction import RedactionResult
25
+
26
+ logger = get_logger()
27
+
28
+
29
+ PRIVACY_MODEL_ID = "openai/privacy-filter"
30
+
31
+ # Only mask spans the model is reasonably confident about.
32
+ PRIVACY_MIN_SCORE = 0.5
33
+
34
+ # Model entity group -> (placeholder written into the text, human label for notes).
35
+ PII_TYPES: dict[str, tuple[str, str]] = {
36
+ "private_person": ("[REDACTED_NAME]", "personal name"),
37
+ "private_email": ("[REDACTED_EMAIL]", "email address"),
38
+ "private_phone": ("[REDACTED_PHONE]", "phone number"),
39
+ "private_address": ("[REDACTED_ADDRESS]", "postal address"),
40
+ "private_url": ("[REDACTED_URL]", "personal URL"),
41
+ "private_date": ("[REDACTED_DATE]", "personal date"),
42
+ "account_number": ("[REDACTED_ACCOUNT]", "account number"),
43
+ "secret": ("[REDACTED_SECRET]", "secret"),
44
+ }
45
+
46
+ # (texts) -> per-text list of {"start", "end", "label"} spans.
47
+ DetectFn = Callable[[list[str]], list[list[dict[str, Any]]]]
48
+
49
+ _PIPELINE_CACHE: dict[str, Any] = {}
50
+
51
+
52
+ def redact_texts(
53
+ texts: list[str],
54
+ *,
55
+ detect: DetectFn | None = None,
56
+ device: str | None = None,
57
+ ) -> list[RedactionResult]:
58
+ """Detect and mask PII in each text, returning one result per input.
59
+
60
+ ``detect`` defaults to :func:`_local_detect` (the lazy model); tests inject a
61
+ stand-in so the masking logic runs without ``torch``. ``device`` forces the
62
+ compute device for the default detector (``cuda`` / ``mps`` / ``cpu``).
63
+ """
64
+
65
+ detector = detect or functools.partial(_local_detect, device=device)
66
+ spans_per_text = detector(texts)
67
+ return [_apply_spans(text, spans) for text, spans in zip(texts, spans_per_text)]
68
+
69
+
70
+ def _merge_spans(text: str, spans: list[dict[str, Any]]) -> list[dict[str, Any]]:
71
+ """Drop malformed spans and merge same-label runs into clean, disjoint spans.
72
+
73
+ ``openai/privacy-filter`` uses BIOES tags, which the pipeline's IOB-oriented
74
+ "simple" aggregation can split into adjacent fragments of one entity (and a
75
+ leading separator can leave a one-character gap). Merging same-label spans
76
+ that overlap or sit within one character keeps each entity to a single
77
+ placeholder; a remaining different-label overlap is clipped to stay disjoint.
78
+ """
79
+
80
+ valid = [
81
+ span
82
+ for span in spans
83
+ if span.get("label") in PII_TYPES
84
+ and 0 <= int(span["start"]) < int(span["end"]) <= len(text)
85
+ ]
86
+ valid.sort(key=lambda span: (int(span["start"]), int(span["end"])))
87
+
88
+ merged: list[dict[str, Any]] = []
89
+ for span in valid:
90
+ start, end, label = int(span["start"]), int(span["end"]), span["label"]
91
+ if merged:
92
+ prev = merged[-1]
93
+ if label == prev["label"] and start <= prev["end"] + 1:
94
+ prev["end"] = max(prev["end"], end)
95
+ continue
96
+ if start < prev["end"]: # different-label overlap: keep them disjoint
97
+ start = prev["end"]
98
+ if start >= end:
99
+ continue
100
+ merged.append({"start": start, "end": end, "label": label})
101
+ return merged
102
+
103
+
104
+ def _apply_spans(text: str, spans: list[dict[str, Any]]) -> RedactionResult:
105
+ """Replace detected spans with typed placeholders, right-to-left."""
106
+
107
+ counts: Counter[str] = Counter()
108
+ redacted = text
109
+ for span in sorted(_merge_spans(text, spans), key=lambda span: span["start"], reverse=True):
110
+ placeholder, label = PII_TYPES[span["label"]]
111
+ redacted = redacted[: span["start"]] + placeholder + redacted[span["end"] :]
112
+ counts[label] += 1
113
+
114
+ notes = [f"{label}: {count}" for label, count in sorted(counts.items())]
115
+ return RedactionResult(text=redacted, notes=notes, count=sum(counts.values()))
116
+
117
+
118
+ def _local_detect(texts: list[str], device: str | None = None) -> list[list[dict[str, Any]]]:
119
+ """Run ``openai/privacy-filter`` and return confident PII spans per text.
120
+
121
+ Imported lazily: ``transformers``/``torch`` only need to exist where the
122
+ model actually runs, never for the deterministic path, tests, or light local
123
+ development.
124
+ """
125
+
126
+ pipe = _load_pipeline(device=device)
127
+ started = time.perf_counter()
128
+ results: list[list[dict[str, Any]]] = []
129
+ for text in texts:
130
+ if not text.strip():
131
+ results.append([])
132
+ continue
133
+ entities = pipe(text)
134
+ spans = [
135
+ {
136
+ "start": int(entity["start"]),
137
+ "end": int(entity["end"]),
138
+ "label": entity["entity_group"],
139
+ }
140
+ for entity in entities
141
+ if entity.get("entity_group") in PII_TYPES
142
+ and entity.get("start") is not None
143
+ and entity.get("end") is not None
144
+ and float(entity.get("score", 1.0)) >= PRIVACY_MIN_SCORE
145
+ ]
146
+ results.append(spans)
147
+ detected = sum(len(spans) for spans in results)
148
+ logger.debug(
149
+ "privacy-filter scanned %d messages, %d raw spans in %.2fs",
150
+ len(texts),
151
+ detected,
152
+ time.perf_counter() - started,
153
+ )
154
+ return results
155
+
156
+
157
+ def _load_pipeline(device: str | None = None) -> Any:
158
+ """Lazily build and cache the token-classification pipeline per device."""
159
+
160
+ resolved = resolve_device(device)
161
+ cached = _PIPELINE_CACHE.get(resolved)
162
+ if cached is not None:
163
+ return cached
164
+
165
+ from transformers import pipeline
166
+
167
+ # transformers pipeline device: 0 for cuda, "mps"/"cpu" otherwise.
168
+ pipe_device = 0 if resolved == "cuda" else resolved
169
+ started = time.perf_counter()
170
+ pipe = pipeline(
171
+ "token-classification",
172
+ model=PRIVACY_MODEL_ID,
173
+ aggregation_strategy="simple",
174
+ device=pipe_device,
175
+ )
176
+ logger.info(
177
+ "loaded %s on %s in %.1fs", PRIVACY_MODEL_ID, resolved, time.perf_counter() - started
178
+ )
179
+ _PIPELINE_CACHE[resolved] = pipe
180
+ return pipe
profiling.py ADDED
@@ -0,0 +1,125 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """Lightweight logging + profiling for the Trace Field Notes pipeline.
2
+
3
+ Everything here writes to the standard logging system, never the UI. Set the log
4
+ level with the ``TFN_LOG_LEVEL`` env var (default ``INFO``); use ``DEBUG`` for
5
+ per-stage detail. Resource probes (process RSS, system memory, CPU, and
6
+ GPU/MPS memory) are best-effort and degrade silently if a dependency is missing
7
+ — so the deterministic path, the test suite, and local development never need
8
+ ``psutil`` or ``torch`` installed.
9
+ """
10
+
11
+ from __future__ import annotations
12
+
13
+ import logging
14
+ import os
15
+ import time
16
+ from contextlib import contextmanager
17
+ from typing import Any, Iterator
18
+
19
+
20
+ def get_logger(name: str = "trace_field_notes") -> logging.Logger:
21
+ logger = logging.getLogger(name)
22
+ if not logger.handlers:
23
+ handler = logging.StreamHandler()
24
+ handler.setFormatter(
25
+ logging.Formatter("%(asctime)s [%(name)s] %(levelname)s %(message)s")
26
+ )
27
+ logger.addHandler(handler)
28
+ logger.setLevel(os.getenv("TFN_LOG_LEVEL", "INFO").upper())
29
+ logger.propagate = False
30
+ return logger
31
+
32
+
33
+ logger = get_logger()
34
+
35
+
36
+ def resource_snapshot() -> dict[str, Any]:
37
+ """Best-effort process + system resource probe. Never raises."""
38
+
39
+ snap: dict[str, Any] = {}
40
+ try:
41
+ import psutil
42
+
43
+ proc = psutil.Process()
44
+ snap["rss_mb"] = round(proc.memory_info().rss / 1024 / 1024, 1)
45
+ vm = psutil.virtual_memory()
46
+ snap["sys_mem_pct"] = vm.percent
47
+ snap["sys_mem_avail_mb"] = round(vm.available / 1024 / 1024, 1)
48
+ snap["cpu_pct"] = psutil.cpu_percent(interval=None)
49
+ except Exception: # noqa: BLE001 - profiling must never break the request
50
+ pass
51
+ try:
52
+ import torch
53
+
54
+ if torch.cuda.is_available():
55
+ snap["accel"] = "cuda"
56
+ snap["accel_mem_mb"] = round(torch.cuda.memory_allocated() / 1024 / 1024, 1)
57
+ elif getattr(torch.backends, "mps", None) and torch.backends.mps.is_available():
58
+ snap["accel"] = "mps"
59
+ snap["accel_mem_mb"] = round(
60
+ torch.mps.current_allocated_memory() / 1024 / 1024, 1
61
+ )
62
+ except Exception: # noqa: BLE001
63
+ pass
64
+ return snap
65
+
66
+
67
+ def format_snapshot(snap: dict[str, Any]) -> str:
68
+ parts = []
69
+ if "rss_mb" in snap:
70
+ parts.append(f"rss={snap['rss_mb']}MB")
71
+ if "sys_mem_pct" in snap:
72
+ parts.append(f"sysmem={snap['sys_mem_pct']}%")
73
+ if "cpu_pct" in snap:
74
+ parts.append(f"cpu={snap['cpu_pct']}%")
75
+ if "accel_mem_mb" in snap:
76
+ parts.append(f"{snap.get('accel', 'accel')}={snap['accel_mem_mb']}MB")
77
+ return " ".join(parts) or "n/a"
78
+
79
+
80
+ class Profiler:
81
+ """Accumulates per-stage timings + counts for one request and logs a summary."""
82
+
83
+ def __init__(self, label: str = "analyze") -> None:
84
+ self.label = label
85
+ self._t0 = time.perf_counter()
86
+ self.stages: list[tuple[str, float]] = []
87
+ self.meta: dict[str, Any] = {}
88
+
89
+ @contextmanager
90
+ def stage(self, name: str) -> Iterator[None]:
91
+ start = time.perf_counter()
92
+ logger.debug(
93
+ "%s: stage %r start | %s", self.label, name, format_snapshot(resource_snapshot())
94
+ )
95
+ try:
96
+ yield
97
+ finally:
98
+ dt = time.perf_counter() - start
99
+ self.stages.append((name, dt))
100
+ logger.debug("%s: stage %r done in %.3fs", self.label, name, dt)
101
+
102
+ def record(self, name: str, seconds: float) -> None:
103
+ """Record a stage duration measured by the caller (no context manager)."""
104
+
105
+ self.stages.append((name, seconds))
106
+ logger.debug("%s: stage %r done in %.3fs", self.label, name, seconds)
107
+
108
+ def mark(self, **kwargs: Any) -> None:
109
+ self.meta.update(kwargs)
110
+
111
+ def elapsed(self) -> float:
112
+ return time.perf_counter() - self._t0
113
+
114
+ def summary(self) -> None:
115
+ total = self.elapsed()
116
+ stage_str = ", ".join(f"{name}={dt * 1000:.0f}ms" for name, dt in self.stages)
117
+ meta_str = " ".join(f"{key}={value}" for key, value in self.meta.items())
118
+ logger.info(
119
+ "%s done in %.3fs | %s | stages: %s | %s",
120
+ self.label,
121
+ total,
122
+ meta_str or "-",
123
+ stage_str or "-",
124
+ format_snapshot(resource_snapshot()),
125
+ )
requirements.txt CHANGED
@@ -2,6 +2,7 @@ gradio>=6.16,<7
2
  huggingface_hub>=0.30
3
  spaces>=0.50
4
  torch>=2.4
5
- transformers>=4.57
6
  accelerate>=1.0
7
  einops>=0.8
 
 
2
  huggingface_hub>=0.30
3
  spaces>=0.50
4
  torch>=2.4
5
+ transformers>=5.6
6
  accelerate>=1.0
7
  einops>=0.8
8
+ psutil>=5.9
schemas.py CHANGED
@@ -149,6 +149,7 @@ class AnalysisResult:
149
  engine: str = "deterministic-codebook"
150
  model_notes: list[str] = field(default_factory=list)
151
  model_memo: dict[str, Any] = field(default_factory=dict)
 
152
 
153
  def to_dict(self) -> dict[str, Any]:
154
  return {
@@ -163,4 +164,5 @@ class AnalysisResult:
163
  "engine": self.engine,
164
  "model_notes": self.model_notes,
165
  "model_memo": self.model_memo,
 
166
  }
 
149
  engine: str = "deterministic-codebook"
150
  model_notes: list[str] = field(default_factory=list)
151
  model_memo: dict[str, Any] = field(default_factory=dict)
152
+ session_verdict: dict[str, Any] = field(default_factory=dict)
153
 
154
  def to_dict(self) -> dict[str, Any]:
155
  return {
 
164
  "engine": self.engine,
165
  "model_notes": self.model_notes,
166
  "model_memo": self.model_memo,
167
+ "session_verdict": self.session_verdict,
168
  }
tests/test_model_runtime.py CHANGED
@@ -14,16 +14,45 @@ from model_runtime import (
14
  QUICK_MODEL_ID,
15
  _chat_template_kwargs,
16
  _prepare_generation_inputs,
17
- parse_model_json,
18
- run_model_assist,
 
19
  )
20
 
21
 
22
- MEMO_JSON = {
23
- "executive_memo": "The trace shows a visible upload-boundary correction.",
24
- "detour_memo": "E01 narrows scope instead of changing the parser.",
25
- "outcome_audit_memo": "The agent keeps a deployment caveat visible.",
26
- "caveats": ["Model memo is based only on redacted narrative."],
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  }
28
 
29
 
@@ -37,7 +66,7 @@ class RecordingGenerator:
37
  self.calls.append(
38
  {"messages": messages, "model_id": model_id, "max_new_tokens": max_new_tokens}
39
  )
40
- return json.dumps(MEMO_JSON)
41
 
42
 
43
  class FakeTensor:
@@ -57,46 +86,62 @@ class ModelRuntimeTests(unittest.TestCase):
57
  self.assertIn("NVIDIA Nemotron 3 Nano 30B-A3B", label)
58
  self.assertNotIn("small", label.lower())
59
 
60
- def test_parse_model_json_validates_required_shape(self) -> None:
61
- memo = parse_model_json(json.dumps(MEMO_JSON))
 
 
62
 
63
- self.assertEqual(memo["executive_memo"], MEMO_JSON["executive_memo"])
64
- self.assertEqual(memo["caveats"], MEMO_JSON["caveats"])
 
 
 
 
 
 
 
 
 
65
 
66
- def test_parse_model_json_recovers_from_code_fence(self) -> None:
67
- memo = parse_model_json("```json\n" + json.dumps(MEMO_JSON) + "\n```")
68
 
69
- self.assertEqual(memo["detour_memo"], MEMO_JSON["detour_memo"])
 
70
 
71
- def test_parse_model_json_extracts_object_from_prose(self) -> None:
72
- raw = "Here is the analysis:\n" + json.dumps(MEMO_JSON) + "\nHope this helps."
73
- memo = parse_model_json(raw)
74
 
75
- self.assertEqual(memo["outcome_audit_memo"], MEMO_JSON["outcome_audit_memo"])
 
 
76
 
77
- def test_parse_model_json_uses_final_object_after_thinking_braces(self) -> None:
 
 
78
  raw = (
79
  "<think>Draft {not json} and a scratch object "
80
  '{"draft": "ignore this"} before the final answer.</think>\n'
81
- + json.dumps(MEMO_JSON)
82
  )
83
- memo = parse_model_json(raw)
 
 
84
 
85
- self.assertEqual(memo["executive_memo"], MEMO_JSON["executive_memo"])
 
 
86
 
87
- def test_run_model_assist_uses_selected_model(self) -> None:
88
- result, narrative = analyze_trace_file(Path("examples/sample_trace_redacted.jsonl"))
89
  generate = RecordingGenerator()
90
 
91
- assist = run_model_assist(
92
  engine="nemotron",
93
- result=result,
94
- narrative_text=narrative,
95
  generate=generate,
96
  )
97
 
98
- self.assertEqual(assist.model_id, PRIMARY_MODEL_ID)
99
- self.assertIn("upload-boundary", assist.memo["executive_memo"])
100
  self.assertEqual(generate.calls[0]["model_id"], PRIMARY_MODEL_ID)
101
  self.assertEqual(generate.calls[0]["max_new_tokens"], MODEL_MAX_NEW_TOKENS)
102
 
@@ -121,12 +166,6 @@ class ModelRuntimeTests(unittest.TestCase):
121
  self.assertEqual(generation_inputs["input_ids"], input_ids)
122
  self.assertEqual(generation_inputs["attention_mask"], attention_mask)
123
  self.assertEqual(prompt_tokens, 21)
124
- self.assertEqual(input_ids.device, "cuda")
125
- self.assertEqual(attention_mask.device, "cuda")
126
-
127
- def test_qwen_chat_template_enables_thinking(self) -> None:
128
- self.assertEqual(_chat_template_kwargs(QUICK_MODEL_ID), {"enable_thinking": True})
129
- self.assertEqual(_chat_template_kwargs(PRIMARY_MODEL_ID), {})
130
 
131
  def test_analyzer_records_unknown_engine_note(self) -> None:
132
  result, _ = analyze_trace_file(
@@ -138,30 +177,61 @@ class ModelRuntimeTests(unittest.TestCase):
138
  self.assertIn("Unknown analysis engine", result.model_notes[0])
139
 
140
  def test_analyzer_model_error_note_avoids_double_period(self) -> None:
141
- with patch("analyzer.run_model_assist", side_effect=ValueError("model unavailable.")):
142
  result, _ = analyze_trace_file(
143
  Path("examples/sample_trace_redacted.jsonl"),
144
- analysis_engine="qwen",
145
  )
146
 
147
  self.assertTrue(result.model_notes)
148
  self.assertNotIn("..", result.model_notes[0])
149
  self.assertIn("ValueError: model unavailable.", result.model_notes[0])
150
 
151
- def test_analyzer_records_model_engine_on_success(self) -> None:
152
- with patch("analyzer.run_model_assist") as run_model_assist:
153
- run_model_assist.return_value = types.SimpleNamespace(
154
  model_id=PRIMARY_MODEL_ID,
155
- memo=dict(MEMO_JSON),
156
- note="ok",
157
  )
158
  result, _ = analyze_trace_file(
159
  Path("examples/sample_trace_redacted.jsonl"),
160
  analysis_engine="nemotron",
161
  )
162
 
163
- self.assertIn(PRIMARY_MODEL_ID, result.engine)
164
- self.assertNotIn("token", run_model_assist.call_args.kwargs)
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
165
 
166
 
167
  if __name__ == "__main__":
 
14
  QUICK_MODEL_ID,
15
  _chat_template_kwargs,
16
  _prepare_generation_inputs,
17
+ parse_analysis_json,
18
+ resolve_device,
19
+ run_model_analysis,
20
  )
21
 
22
 
23
+ ANALYSIS_JSON = {
24
+ "verdict": {
25
+ "tone": "partial",
26
+ "headline": "Reroute landed with a caveat.",
27
+ "detail": "The agent caught a wrong assumption about the upload shape and narrowed the fix.",
28
+ "honesty": "candid",
29
+ },
30
+ "overall_patterns": {
31
+ "difficulty_style": "One localization snag.",
32
+ "detour_style": "A productive narrowing.",
33
+ "recovery_style": "Reflective.",
34
+ "risk_or_caveat": "Deployment path left unverified.",
35
+ },
36
+ "episodes": [
37
+ {
38
+ "start_index": 0,
39
+ "end_index": 3,
40
+ "title": "Upload boundary fix",
41
+ "initial_intention": "Inspect the failing upload path.",
42
+ "reported_difficulty": "The Gradio file object can arrive as a temporary path.",
43
+ "difficulty_type": "localization_difficulty",
44
+ "appraisal": "initial_hypothesis_wrong",
45
+ "strategy_before": "Fix the parser.",
46
+ "strategy_after": "Narrow the fix to the upload boundary.",
47
+ "detour_type": "scope_narrowing",
48
+ "resolution_mode": "defensive_handling",
49
+ "recovery_pattern": "reflective_recovery",
50
+ "outcome_claim": "resolved_with_caveat",
51
+ "productive_detour": "yes",
52
+ "evidence_quotes": ["my initial assumption about the upload shape was wrong"],
53
+ "analyst_memo": "The agent names the wrong assumption and picks the smaller change.",
54
+ }
55
+ ],
56
  }
57
 
58
 
 
66
  self.calls.append(
67
  {"messages": messages, "model_id": model_id, "max_new_tokens": max_new_tokens}
68
  )
69
+ return json.dumps(ANALYSIS_JSON)
70
 
71
 
72
  class FakeTensor:
 
86
  self.assertIn("NVIDIA Nemotron 3 Nano 30B-A3B", label)
87
  self.assertNotIn("small", label.lower())
88
 
89
+ def test_minicpm_is_the_quick_engine(self) -> None:
90
+ self.assertEqual(MODEL_CHOICES["minicpm"]["model_id"], QUICK_MODEL_ID)
91
+ self.assertIn("MiniCPM5 1B", str(MODEL_CHOICES["minicpm"]["label"]))
92
+ self.assertNotIn("qwen", MODEL_CHOICES)
93
 
94
+ def test_minicpm_chat_template_disables_thinking(self) -> None:
95
+ self.assertEqual(_chat_template_kwargs(QUICK_MODEL_ID), {"enable_thinking": False})
96
+ self.assertEqual(_chat_template_kwargs(PRIMARY_MODEL_ID), {})
97
+
98
+ def test_resolve_device_honors_explicit_override(self) -> None:
99
+ self.assertEqual(resolve_device("cpu"), "cpu")
100
+ self.assertEqual(resolve_device("cuda"), "cuda")
101
+ self.assertEqual(resolve_device("mps"), "mps")
102
+
103
+ def test_parse_analysis_json_validates_shape(self) -> None:
104
+ parsed = parse_analysis_json(json.dumps(ANALYSIS_JSON))
105
 
106
+ self.assertEqual(len(parsed["episodes"]), 1)
107
+ self.assertEqual(parsed["verdict"]["tone"], "partial")
108
 
109
+ def test_parse_analysis_json_recovers_from_code_fence(self) -> None:
110
+ parsed = parse_analysis_json("```json\n" + json.dumps(ANALYSIS_JSON) + "\n```")
111
 
112
+ self.assertEqual(parsed["episodes"][0]["difficulty_type"], "localization_difficulty")
 
 
113
 
114
+ def test_parse_analysis_json_extracts_object_from_prose(self) -> None:
115
+ raw = "Here is the report:\n" + json.dumps(ANALYSIS_JSON) + "\nDone."
116
+ parsed = parse_analysis_json(raw)
117
 
118
+ self.assertEqual(parsed["verdict"]["honesty"], "candid")
119
+
120
+ def test_parse_analysis_json_uses_final_object_after_thinking_braces(self) -> None:
121
  raw = (
122
  "<think>Draft {not json} and a scratch object "
123
  '{"draft": "ignore this"} before the final answer.</think>\n'
124
+ + json.dumps(ANALYSIS_JSON)
125
  )
126
+ parsed = parse_analysis_json(raw)
127
+
128
+ self.assertEqual(len(parsed["episodes"]), 1)
129
 
130
+ def test_parse_analysis_json_requires_episodes_list(self) -> None:
131
+ with self.assertRaises(ValueError):
132
+ parse_analysis_json(json.dumps({"verdict": {}, "overall_patterns": {}}))
133
 
134
+ def test_run_model_analysis_uses_selected_model(self) -> None:
 
135
  generate = RecordingGenerator()
136
 
137
+ produced = run_model_analysis(
138
  engine="nemotron",
139
+ numbered_narrative="[0] assistant 10:00: hello",
 
140
  generate=generate,
141
  )
142
 
143
+ self.assertEqual(produced.model_id, PRIMARY_MODEL_ID)
144
+ self.assertEqual(len(produced.analysis["episodes"]), 1)
145
  self.assertEqual(generate.calls[0]["model_id"], PRIMARY_MODEL_ID)
146
  self.assertEqual(generate.calls[0]["max_new_tokens"], MODEL_MAX_NEW_TOKENS)
147
 
 
166
  self.assertEqual(generation_inputs["input_ids"], input_ids)
167
  self.assertEqual(generation_inputs["attention_mask"], attention_mask)
168
  self.assertEqual(prompt_tokens, 21)
 
 
 
 
 
 
169
 
170
  def test_analyzer_records_unknown_engine_note(self) -> None:
171
  result, _ = analyze_trace_file(
 
177
  self.assertIn("Unknown analysis engine", result.model_notes[0])
178
 
179
  def test_analyzer_model_error_note_avoids_double_period(self) -> None:
180
+ with patch("analyzer.run_model_analysis", side_effect=ValueError("model unavailable.")):
181
  result, _ = analyze_trace_file(
182
  Path("examples/sample_trace_redacted.jsonl"),
183
+ analysis_engine="minicpm",
184
  )
185
 
186
  self.assertTrue(result.model_notes)
187
  self.assertNotIn("..", result.model_notes[0])
188
  self.assertIn("ValueError: model unavailable.", result.model_notes[0])
189
 
190
+ def test_analyzer_replaces_analysis_on_model_success(self) -> None:
191
+ with patch("analyzer.run_model_analysis") as run:
192
+ run.return_value = types.SimpleNamespace(
193
  model_id=PRIMARY_MODEL_ID,
194
+ analysis=dict(ANALYSIS_JSON),
195
+ note=f"Analysis produced by {PRIMARY_MODEL_ID}.",
196
  )
197
  result, _ = analyze_trace_file(
198
  Path("examples/sample_trace_redacted.jsonl"),
199
  analysis_engine="nemotron",
200
  )
201
 
202
+ self.assertEqual(result.engine, PRIMARY_MODEL_ID)
203
+ self.assertEqual(result.session_verdict["tone"], "partial")
204
+ self.assertEqual(result.episodes[0].episode_id, "E01")
205
+ self.assertEqual(result.episodes[0].difficulty_type, "localization_difficulty")
206
+
207
+ def test_analyzer_strips_placeholder_echoes(self) -> None:
208
+ bad = {
209
+ "verdict": {"tone": "stable", "headline": "<= 12 words", "detail": "2-4 sentences", "honesty": "candid"},
210
+ "overall_patterns": {},
211
+ "episodes": [
212
+ {
213
+ "start_index": 0,
214
+ "end_index": 0,
215
+ "title": "<= 10 words",
216
+ "reported_difficulty": "The build failed.",
217
+ "difficulty_type": "environment_blocker",
218
+ "analyst_memo": "1-3 sentences",
219
+ "evidence_quotes": ["short verbatim quote", "the build failed"],
220
+ "outcome_claim": "not_resolved",
221
+ }
222
+ ],
223
+ }
224
+ with patch("analyzer.run_model_analysis") as run:
225
+ run.return_value = types.SimpleNamespace(model_id=QUICK_MODEL_ID, analysis=bad, note="ok")
226
+ result, _ = analyze_trace_file(
227
+ Path("examples/sample_trace_redacted.jsonl"), analysis_engine="minicpm"
228
+ )
229
+
230
+ episode = result.episodes[0]
231
+ self.assertEqual(episode.title, "The build failed.") # placeholder -> reported_difficulty
232
+ self.assertEqual(episode.analyst_memo, "") # "1-3 sentences" stripped
233
+ self.assertEqual(episode.evidence_quotes, ["the build failed"]) # placeholder quote dropped
234
+ self.assertNotIn("<", result.session_verdict["headline"])
235
 
236
 
237
  if __name__ == "__main__":
tests/test_privacy_filter.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import unittest
4
+ from pathlib import Path
5
+
6
+ from analyzer import stream_deterministic_analysis
7
+ from privacy_filter import PII_TYPES, redact_texts
8
+ from redaction import RedactionResult
9
+
10
+
11
+ def fake_detect(texts: list[str]) -> list[list[dict]]:
12
+ """Stand-in detector: flags "Alice Smith" and "555-1234" without torch."""
13
+
14
+ results = []
15
+ for text in texts:
16
+ spans = []
17
+ person = text.find("Alice Smith")
18
+ if person != -1:
19
+ spans.append({"start": person, "end": person + len("Alice Smith"), "label": "private_person"})
20
+ phone = text.find("555-1234")
21
+ if phone != -1:
22
+ spans.append({"start": phone, "end": phone + len("555-1234"), "label": "private_phone"})
23
+ results.append(spans)
24
+ return results
25
+
26
+
27
+ def _drain(stream):
28
+ result = None
29
+ for kind, payload in stream:
30
+ if kind == "result":
31
+ result = payload[0]
32
+ assert result is not None
33
+ return result
34
+
35
+
36
+ class PrivacyFilterMaskingTests(unittest.TestCase):
37
+ def test_redact_texts_masks_detected_spans(self) -> None:
38
+ texts = ["Call Alice Smith at 555-1234 tomorrow.", "no pii here"]
39
+
40
+ results = redact_texts(texts, detect=fake_detect)
41
+
42
+ self.assertIsInstance(results[0], RedactionResult)
43
+ self.assertNotIn("Alice Smith", results[0].text)
44
+ self.assertNotIn("555-1234", results[0].text)
45
+ self.assertIn(PII_TYPES["private_person"][0], results[0].text)
46
+ self.assertIn(PII_TYPES["private_phone"][0], results[0].text)
47
+ self.assertEqual(results[0].count, 2)
48
+ self.assertEqual(results[1].count, 0)
49
+ self.assertEqual(results[1].text, "no pii here")
50
+
51
+ def test_notes_are_human_readable(self) -> None:
52
+ results = redact_texts(["Alice Smith"], detect=fake_detect)
53
+
54
+ self.assertIn("personal name: 1", results[0].notes)
55
+
56
+ def test_malformed_and_overlapping_spans_are_skipped(self) -> None:
57
+ def detect(texts: list[str]) -> list[list[dict]]:
58
+ return [
59
+ [
60
+ {"start": 0, "end": 999, "label": "secret"}, # out of range
61
+ {"start": 2, "end": 2, "label": "secret"}, # zero width
62
+ ]
63
+ ]
64
+
65
+ results = redact_texts(["abc"], detect=detect)
66
+
67
+ self.assertEqual(results[0].text, "abc")
68
+ self.assertEqual(results[0].count, 0)
69
+
70
+ def test_unknown_labels_are_ignored(self) -> None:
71
+ def detect(texts: list[str]) -> list[list[dict]]:
72
+ return [[{"start": 0, "end": 3, "label": "not_a_pii_type"}]]
73
+
74
+ results = redact_texts(["abc"], detect=detect)
75
+
76
+ self.assertEqual(results[0].text, "abc")
77
+ self.assertEqual(results[0].count, 0)
78
+
79
+ def test_bioes_fragments_merge_into_one_placeholder(self) -> None:
80
+ # The real model fragments "Alice Smith" into touching same-label spans
81
+ # ("Alice" + " Smith"); they must collapse to a single placeholder.
82
+ def detect(texts: list[str]) -> list[list[dict]]:
83
+ return [
84
+ [
85
+ {"start": 0, "end": 5, "label": "private_person"}, # Alice
86
+ {"start": 5, "end": 11, "label": "private_person"}, # " Smith"
87
+ ]
88
+ ]
89
+
90
+ results = redact_texts(["Alice Smith calls"], detect=detect)
91
+
92
+ self.assertEqual(results[0].text.count("[REDACTED_NAME]"), 1)
93
+ self.assertEqual(results[0].count, 1)
94
+ self.assertEqual(results[0].text, "[REDACTED_NAME] calls")
95
+
96
+ def test_same_label_spans_with_one_char_gap_merge(self) -> None:
97
+ def detect(texts: list[str]) -> list[list[dict]]:
98
+ return [
99
+ [
100
+ {"start": 0, "end": 5, "label": "private_person"}, # Alice
101
+ {"start": 6, "end": 11, "label": "private_person"}, # Smith (gap = space)
102
+ ]
103
+ ]
104
+
105
+ results = redact_texts(["Alice Smith"], detect=detect)
106
+
107
+ self.assertEqual(results[0].count, 1)
108
+
109
+ def test_different_label_adjacent_spans_stay_separate(self) -> None:
110
+ def detect(texts: list[str]) -> list[list[dict]]:
111
+ return [
112
+ [
113
+ {"start": 0, "end": 5, "label": "private_person"},
114
+ {"start": 6, "end": 14, "label": "private_phone"},
115
+ ]
116
+ ]
117
+
118
+ results = redact_texts(["Alice 555-1234"], detect=detect)
119
+
120
+ self.assertEqual(results[0].count, 2)
121
+ self.assertIn(PII_TYPES["private_person"][0], results[0].text)
122
+ self.assertIn(PII_TYPES["private_phone"][0], results[0].text)
123
+
124
+
125
+ class StreamRedactionIntegrationTests(unittest.TestCase):
126
+ SAMPLE = Path("examples/sample_trace_redacted.jsonl")
127
+
128
+ def test_stream_records_ai_privacy_note_when_model_runs(self) -> None:
129
+ def passthrough(texts: list[str]) -> list[RedactionResult]:
130
+ return [RedactionResult(text=text, notes=[], count=0) for text in texts]
131
+
132
+ result = _drain(stream_deterministic_analysis(self.SAMPLE, model_redact=passthrough))
133
+
134
+ self.assertTrue(any("AI privacy filter (openai/privacy-filter)" in note for note in result.privacy_notes))
135
+
136
+ def test_stream_falls_back_gracefully_when_model_unavailable(self) -> None:
137
+ def boom(texts: list[str]) -> list[RedactionResult]:
138
+ raise RuntimeError("no gpu here")
139
+
140
+ result = _drain(stream_deterministic_analysis(self.SAMPLE, model_redact=boom))
141
+
142
+ self.assertTrue(any("AI privacy filter was unavailable" in note for note in result.privacy_notes))
143
+ # Regex redaction still ran on the sample (it embeds an email + token).
144
+ self.assertGreater(result.redaction_count, 0)
145
+
146
+ def test_redact_progress_streams_per_chunk(self) -> None:
147
+ events = [
148
+ payload
149
+ for kind, payload in stream_deterministic_analysis(
150
+ self.SAMPLE, stream_redact_progress=True
151
+ )
152
+ if kind == "progress" and payload.get("stage") == "redact"
153
+ ]
154
+
155
+ # 4-message sample -> chunk size 1 -> one redact event per message.
156
+ self.assertGreaterEqual(len(events), 2)
157
+ processed = [event["processed"] for event in events]
158
+ self.assertEqual(processed, sorted(processed)) # monotonically advancing
159
+ self.assertEqual(events[-1]["processed"], events[-1]["total"]) # finishes at total
160
+ self.assertTrue(all(event["total"] == events[0]["total"] for event in events))
161
+
162
+ def test_model_redaction_count_adds_to_regex_count(self) -> None:
163
+ def mask_first_word(texts: list[str]) -> list[RedactionResult]:
164
+ out = []
165
+ for text in texts:
166
+ if text:
167
+ out.append(RedactionResult(text="[REDACTED_NAME]" + text, notes=["personal name: 1"], count=1))
168
+ else:
169
+ out.append(RedactionResult(text=text, notes=[], count=0))
170
+ return out
171
+
172
+ regex_only = _drain(stream_deterministic_analysis(self.SAMPLE))
173
+ combined = _drain(stream_deterministic_analysis(self.SAMPLE, model_redact=mask_first_word))
174
+
175
+ self.assertGreater(combined.redaction_count, regex_only.redaction_count)
176
+
177
+
178
+ if __name__ == "__main__":
179
+ unittest.main()
tests/test_profiling.py ADDED
@@ -0,0 +1,37 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ from __future__ import annotations
2
+
3
+ import unittest
4
+
5
+ from profiling import Profiler, format_snapshot, resource_snapshot
6
+
7
+
8
+ class ProfilingTests(unittest.TestCase):
9
+ def test_resource_snapshot_never_raises_and_returns_dict(self) -> None:
10
+ snap = resource_snapshot()
11
+ self.assertIsInstance(snap, dict)
12
+
13
+ def test_format_snapshot_is_string(self) -> None:
14
+ self.assertIsInstance(format_snapshot(resource_snapshot()), str)
15
+ self.assertEqual(format_snapshot({}), "n/a")
16
+
17
+ def test_profiler_records_stages_meta_and_summarizes(self) -> None:
18
+ prof = Profiler("test")
19
+ prof.record("extract", 0.012)
20
+ prof.record("redact", 0.034)
21
+ prof.mark(messages=4, engine="deterministic")
22
+
23
+ self.assertEqual([name for name, _ in prof.stages], ["extract", "redact"])
24
+ self.assertEqual(prof.meta["messages"], 4)
25
+ self.assertGreaterEqual(prof.elapsed(), 0.0)
26
+ prof.summary() # must not raise
27
+
28
+ def test_stage_context_manager_records_duration(self) -> None:
29
+ prof = Profiler("test")
30
+ with prof.stage("chart"):
31
+ pass
32
+ self.assertEqual(prof.stages[-1][0], "chart")
33
+ self.assertGreaterEqual(prof.stages[-1][1], 0.0)
34
+
35
+
36
+ if __name__ == "__main__":
37
+ unittest.main()
view_model.py CHANGED
@@ -71,7 +71,7 @@ def build_view_model(
71
  "narrative_message_count": base["narrative_message_count"],
72
  "redaction_count": base["redaction_count"],
73
  "duration_total": _duration_total(raw_episodes),
74
- "verdict": _verdict(episodes, base["overall_patterns"], result.model_memo),
75
  "overall_patterns": base["overall_patterns"],
76
  "privacy_notes": list(base["privacy_notes"]) + list(base.get("model_notes") or []),
77
  "episodes": episodes,
 
71
  "narrative_message_count": base["narrative_message_count"],
72
  "redaction_count": base["redaction_count"],
73
  "duration_total": _duration_total(raw_episodes),
74
+ "verdict": base.get("session_verdict") or _verdict(episodes, base["overall_patterns"], result.model_memo),
75
  "overall_patterns": base["overall_patterns"],
76
  "privacy_notes": list(base["privacy_notes"]) + list(base.get("model_notes") or []),
77
  "episodes": episodes,