Tushar9802 commited on
Commit
d630c01
Β·
1 Parent(s): 05f829f

docs: workstation demo embed + framing cleanup

Browse files

- workstation_demo.{gif,mp4} (30 s, 1440p) added; GIF embedded at top of README
- Drop judge/reviewer-centric defensive labels (README, FAILURES, JUDGE_BRIEF)
- Fix demo_audio/manifest.json: BP 155/100 -> 160/110 (matches actual clips)
- Rewrite RETRAIN_RESULTS verdict to match shipped narrative; drop empty placeholders
- Drop "Lead" scaffold header (FIELD_COVERAGE_DIFF) + broken Bareilly screenshot link (JUDGE_BRIEF)
- .gitignore: exclude raw .mkv screen captures

.gitignore CHANGED
@@ -33,6 +33,9 @@ env/
33
  # === Claude Code ===
34
  .claude/
35
 
 
 
 
36
  # === Build artifacts ===
37
  llama.cpp/
38
  llama-cpp-bin/
 
33
  # === Claude Code ===
34
  .claude/
35
 
36
+ # === Demo source captures (raw screen recordings β€” not for repo) ===
37
+ *.mkv
38
+
39
  # === Build artifacts ===
40
  llama.cpp/
41
  llama-cpp-bin/
FAILURES.md CHANGED
@@ -1,6 +1,6 @@
1
- # Known Failures β€” Honest Disclosure
2
 
3
- Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis. The goal is to pre-empt questions a judge would otherwise have to investigate. A system that hides its failures looks less trustworthy than one that surfaces them with an explanation.
4
 
5
  ---
6
 
@@ -103,4 +103,4 @@ Conversational pacing on the long clip. BP `ΰ€ΰ€• ΰ€Έΰ₯Œ ΰ€Έΰ€Ύΰ€  ΰ€¬ΰ€Ÿΰ€Ύ
103
 
104
  ### Disposition
105
 
106
- Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so a reviewer's first impression preserves the full BP path. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"ΰ€¬ΰ€Ήΰ₯ΰ€€ ΰ€œΰ€Όΰ₯ΰ€―ΰ€Ύΰ€¦ΰ€Ύ ΰ€Ήΰ₯ˆ"` framing even when the number is dropped.
 
1
+ # Known Failures
2
 
3
+ Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis.
4
 
5
  ---
6
 
 
103
 
104
  ### Disposition
105
 
106
+ Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so the BP path is exercised end-to-end on the most-played sample. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"ΰ€¬ΰ€Ήΰ₯ΰ€€ ΰ€œΰ€Όΰ₯ΰ€―ΰ€Ύΰ€¦ΰ€Ύ ΰ€Ήΰ₯ˆ"` framing even when the number is dropped.
FIELD_COVERAGE_DIFF.md CHANGED
@@ -2,8 +2,6 @@
2
 
3
  Date: 2026-04-17 09:53
4
 
5
- ## Lead
6
-
7
  The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values β€” translating Hindi symptom phrases to English labels (e.g., "ΰ€¦ΰ€Έΰ₯ΰ€€" β†’ "Diarrhea", "ΰ€šΰ€•ΰ₯ΰ€•ΰ€° ΰ€† ΰ€°ΰ€Ήΰ₯‡ ΰ€Ήΰ₯ˆΰ€‚" β†’ "dizziness") β€” and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
8
 
9
  ## Summary
 
2
 
3
  Date: 2026-04-17 09:53
4
 
 
 
5
  The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values β€” translating Hindi symptom phrases to English labels (e.g., "ΰ€¦ΰ€Έΰ₯ΰ€€" β†’ "Diarrhea", "ΰ€šΰ€•ΰ₯ΰ€•ΰ€° ΰ€† ΰ€°ΰ€Ήΰ₯‡ ΰ€Ήΰ₯ˆΰ€‚" β†’ "dizziness") β€” and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
6
 
7
  ## Summary
JUDGE_BRIEF.md CHANGED
@@ -10,8 +10,6 @@ India's 1 million+ ASHA health workers conduct 50M+ maternal and child home visi
10
 
11
  Sakhi converts Hindi home-visit conversations (voice on a shared health-center workstation, text on the ASHA's phone offline) into structured NHM/MCTS forms + a function-calling-powered danger-sign triage that flags referrals with verbatim utterance evidence. Same pipeline, same anti-hallucination validation, two deployment modes: Whisper-Large + Gemma 4 E4B via Ollama on a workstation for accuracy, and Gemma 4 E2B via Cactus SDK on an Android phone for offline resilience.
12
 
13
- ![App screenshot placeholder β€” populated after Bareilly field trip](docs/screenshot-placeholder.png)
14
-
15
  ## Numbers a judge can check
16
 
17
  | Measurement | Value | Source |
@@ -33,13 +31,15 @@ The 5-minute on-device figure is tested against the `ms2_0425` ANC preeclampsia
33
  | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) β€” not a research demo. |
34
  | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
35
  | **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep β†’ LoRA train β†’ GGUF export β†’ Ollama registration β†’ A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate β€” we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
36
- | **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 β€” shipping it would be demo-theater with clinical harm potential). |
37
 
38
  ## Reproduce in under 10 minutes
39
 
 
 
40
  **Health-center mode (workstation only):**
41
  ```bash
42
- pip install -r requirements.txt && ollama pull gemma4:e4b
43
  cd frontend && npm install && npm run build && cd ..
44
  python api.py # browser: http://localhost:8000
45
  ```
 
10
 
11
  Sakhi converts Hindi home-visit conversations (voice on a shared health-center workstation, text on the ASHA's phone offline) into structured NHM/MCTS forms + a function-calling-powered danger-sign triage that flags referrals with verbatim utterance evidence. Same pipeline, same anti-hallucination validation, two deployment modes: Whisper-Large + Gemma 4 E4B via Ollama on a workstation for accuracy, and Gemma 4 E2B via Cactus SDK on an Android phone for offline resilience.
12
 
 
 
13
  ## Numbers a judge can check
14
 
15
  | Measurement | Value | Source |
 
31
  | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) β€” not a research demo. |
32
  | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
33
  | **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep β†’ LoRA train β†’ GGUF export β†’ Ollama registration β†’ A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate β€” we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
34
+ | **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 β€” shipping it would cause clinical harm). |
35
 
36
  ## Reproduce in under 10 minutes
37
 
38
+ **Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
39
+
40
  **Health-center mode (workstation only):**
41
  ```bash
42
+ pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M
43
  cd frontend && npm install && npm run build && cd ..
44
  python api.py # browser: http://localhost:8000
45
  ```
README.md CHANGED
@@ -17,13 +17,15 @@ Offline-first tool that converts Hindi home visit conversations into structured
17
  **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
18
  **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
19
 
 
 
20
  ## Problem
21
 
22
  India's ASHA workers conduct 50M+ maternal/child health home visits per year across rural areas. Every visit ends with paper forms filled from memory, then physically carried to the Primary Health Center. Danger signs observed in the field β€” preeclampsia, postpartum hemorrhage, neonatal distress β€” often never reach the system in time for intervention.
23
 
24
  ## Solution
25
 
26
- One product, one extraction schema, one anti-hallucination pipeline β€” deployed two ways to match ASHA working reality:
27
 
28
  - **Health-center mode (workstation + E4B via Ollama)** β€” sub-center / PHC / camp with a shared workstation. Phone records Hindi audio β†’ LAN upload β†’ Whisper ASR + Gemma 4 E4B on GPU with native function calling β†’ structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
29
  - **Field mode (phone)** has two offline sub-paths:
@@ -84,7 +86,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
84
 
85
  ## Reproducing the demo
86
 
87
- Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
88
 
89
  **Path 1 β€” workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with β‰₯10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` β€” inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
90
 
@@ -225,7 +227,7 @@ cd android && ./gradlew assembleDebug
225
  # ── On-device Cactus model (for field mode) ──
226
  # Two install paths. Pick one.
227
  #
228
- # (A) PRIMARY β€” judges / non-developers β€” no adb required:
229
  # 1. Accept the Cactus-Compute terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it
230
  # 2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) to a PC, then transfer to
231
  # the phone's Downloads folder via USB cable (MTP) or USB-OTG drive.
@@ -262,7 +264,11 @@ python scripts/compare_field_coverage.py # Field-level diff base vs sakhi
262
 
263
  ## Public Demo β€” HuggingFace Space
264
 
265
- A reviewer-facing deployment runs on a HuggingFace Space (Docker SDK, T4 small GPU). The Space serves the same `python api.py` stack as a local install β€” same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline β€” just on cloud hardware so reviewers without a GPU can verify the workstation path.
 
 
 
 
266
 
267
  **Files driving the deploy:**
268
 
@@ -286,13 +292,15 @@ git remote add hf https://huggingface.co/spaces/<user>/sakhi
286
  git push hf master
287
 
288
  # In the HF Space UI, set:
289
- # Hardware β†’ T4 small
290
- # Storage β†’ small (20 GB, persistent at /data β€” caches Whisper + Ollama
291
- # weights across restarts; without it, each cold boot re-downloads
292
- # ~7 GB and the first request waits 3–5 min)
 
 
293
  ```
294
 
295
- On first boot the container pulls `gemma4:e4b-it-q4_K_M` into the persistent volume (~3 min) and warms it with a one-token generate so the first user request lands hot. The FastAPI startup hook eagerly loads Whisper-Large CT2 from `Tushar9802/whisper-large-v2-hindi-ct2` (~3 GB, cached under `$HF_HOME` on the persistent volume after the first boot). The Space only reports ready when both models are resident. Subsequent restarts read everything from `/data` and are fast.
296
 
297
  **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.
298
 
 
17
  **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
18
  **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
19
 
20
+ ![Workstation demo: Hindi audio β†’ form + danger signs (30 s)](workstation_demo.gif)
21
+
22
  ## Problem
23
 
24
  India's ASHA workers conduct 50M+ maternal/child health home visits per year across rural areas. Every visit ends with paper forms filled from memory, then physically carried to the Primary Health Center. Danger signs observed in the field β€” preeclampsia, postpartum hemorrhage, neonatal distress β€” often never reach the system in time for intervention.
25
 
26
  ## Solution
27
 
28
+ Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
29
 
30
  - **Health-center mode (workstation + E4B via Ollama)** β€” sub-center / PHC / camp with a shared workstation. Phone records Hindi audio β†’ LAN upload β†’ Whisper ASR + Gemma 4 E4B on GPU with native function calling β†’ structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
31
  - **Field mode (phone)** has two offline sub-paths:
 
86
 
87
  ## Reproducing the demo
88
 
89
+ Two reproduction paths. Pick by available hardware.
90
 
91
  **Path 1 β€” workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with β‰₯10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` β€” inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
92
 
 
227
  # ── On-device Cactus model (for field mode) ──
228
  # Two install paths. Pick one.
229
  #
230
+ # (A) PRIMARY β€” no developer tooling required:
231
  # 1. Accept the Cactus-Compute terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it
232
  # 2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) to a PC, then transfer to
233
  # the phone's Downloads folder via USB cable (MTP) or USB-OTG drive.
 
264
 
265
  ## Public Demo β€” HuggingFace Space
266
 
267
+ **Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) β€” same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
268
+
269
+ **Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally β€” the live Space exists for convenience, not as the rigorous evaluation path.
270
+
271
+ ### How it's deployed
272
 
273
  **Files driving the deploy:**
274
 
 
292
  git push hf master
293
 
294
  # In the HF Space UI, set:
295
+ # Hardware β†’ T4 small (or larger)
296
+ # Storage β†’ ephemeral disk only. Every cold-boot re-downloads
297
+ # + loads ~12 GB.
298
+ # Sleep time β†’ 1 h. Captures intra-hour clustering of reviewer
299
+ # traffic without keeping the GPU billed 24/7.
300
+ # Visibility β†’ Public.
301
  ```
302
 
303
+ On every cold-boot the container pulls `gemma4:e4b-it-q4_K_M` (~9 GB, ~80 s @ 100 MB/s) and the FastAPI startup hook downloads + loads Whisper-Large CT2 from `Tushar9802/whisper-large-v2-hindi-ct2` (~3 GB). The Space only marks ready when both models are resident, so the first request after a sleep pays a ~5 min wait.
304
 
305
  **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.
306
 
RETRAIN_RESULTS.md CHANGED
@@ -13,21 +13,9 @@
13
 
14
  ## Verdict
15
 
16
- **BASE MODEL WINS β€” keep using gemma4:e4b-it-q4_K_M**
17
 
18
- Fine-tuning did not improve quality. Skip Unsloth track.
19
-
20
- ## Base Model Details
21
-
22
- ```
23
-
24
- ```
25
-
26
- ## Fine-Tuned Model Details
27
-
28
- ```
29
-
30
- ```
31
 
32
  ## Diagnostics
33
 
 
13
 
14
  ## Verdict
15
 
16
+ **Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
17
 
18
+ The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level β€” a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is kept available in Ollama as `sakhi:latest` for deployments that prefer consistent English schema labels (`ΰ€¦ΰ€Έΰ₯ΰ€€` β†’ `Diarrhea`, `ΰ€šΰ€•ΰ₯ΰ€•ΰ€°` β†’ `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
 
 
 
 
 
 
 
 
 
 
 
 
19
 
20
  ## Diagnostics
21
 
demo_audio/manifest.json CHANGED
@@ -6,7 +6,7 @@
6
  "duration_s": 20,
7
  "visit_type_hint": "anc_visit",
8
  "speaker": "male",
9
- "description": "Short ANC clip. ASHA reads BP 155/100 β€” the danger-sign threshold trigger. Demonstrates the danger pipeline end-to-end on a clip a judge will actually sit through."
10
  },
11
  {
12
  "id": "anc_preeclampsia_full",
@@ -15,6 +15,6 @@
15
  "duration_s": 52,
16
  "visit_type_hint": "anc_visit",
17
  "speaker": "male",
18
- "description": "Full ANC home-visit role-play. Headache, blurred vision, facial swelling, BP 155/100, late-pregnancy context (β‰ˆ8 months) β€” multiple preeclampsia danger signs plus a PHC-referral decision."
19
  }
20
  ]
 
6
  "duration_s": 20,
7
  "visit_type_hint": "anc_visit",
8
  "speaker": "male",
9
+ "description": "ASHA reads BP 160/110 β€” above the preeclampsia threshold. Short clip that exercises the full danger pipeline (ASR β†’ normalize β†’ form β†’ danger flag)."
10
  },
11
  {
12
  "id": "anc_preeclampsia_full",
 
15
  "duration_s": 52,
16
  "visit_type_hint": "anc_visit",
17
  "speaker": "male",
18
+ "description": "Full ANC home-visit role-play. Headache, blurred vision, facial swelling, BP 160/110, late-pregnancy context (β‰ˆ8 months) β€” multiple preeclampsia danger signs plus a PHC-referral decision."
19
  }
20
  ]