Spaces:

Tushar9802
/

sakhi

Sleeping

Tushar9802 commited on 29 days ago

Commit

d630c01

1 Parent(s): 05f829f

docs: workstation demo embed + framing cleanup

- workstation_demo.{gif,mp4} (30 s, 1440p) added; GIF embedded at top of README
- Drop judge/reviewer-centric defensive labels (README, FAILURES, JUDGE_BRIEF)
- Fix demo_audio/manifest.json: BP 155/100 -> 160/110 (matches actual clips)
- Rewrite RETRAIN_RESULTS verdict to match shipped narrative; drop empty placeholders
- Drop "Lead" scaffold header (FIELD_COVERAGE_DIFF) + broken Bareilly screenshot link (JUDGE_BRIEF)
- .gitignore: exclude raw .mkv screen captures

Files changed (7) hide show

.gitignore +3 -0
FAILURES.md +3 -3
FIELD_COVERAGE_DIFF.md +0 -2
JUDGE_BRIEF.md +4 -4
README.md +17 -9
RETRAIN_RESULTS.md +2 -14
demo_audio/manifest.json +2 -2

.gitignore CHANGED Viewed

@@ -33,6 +33,9 @@ env/
 # === Claude Code ===
 .claude/
 # === Build artifacts ===
 llama.cpp/
 llama-cpp-bin/

 # === Claude Code ===
 .claude/
+# === Demo source captures (raw screen recordings — not for repo) ===
+*.mkv
 # === Build artifacts ===
 llama.cpp/
 llama-cpp-bin/

FAILURES.md CHANGED Viewed

@@ -1,6 +1,6 @@
-# Known Failures — Honest Disclosure
-Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis. The goal is to pre-empt questions a judge would otherwise have to investigate. A system that hides its failures looks less trustworthy than one that surfaces them with an explanation.
 ---
@@ -103,4 +103,4 @@ Conversational pacing on the long clip. BP `एक सौ साठ बटा
 ### Disposition
-Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so a reviewer's first impression preserves the full BP path. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.

+# Known Failures
+Every test failure in Sakhi's eval suite is recorded here with a root-cause diagnosis.
 ---
 ### Disposition
+Mitigation post-competition: custom Hindi-medical Whisper fine-tune. In-scope mitigation: the short clip is the manifest default so the BP path is exercised end-to-end on the most-played sample. The 52 s clip remains in the dropdown as the longer-conversation evidence; the danger panel still extracts severe-hypertension from the verbatim `"बहुत ज़्यादा है"` framing even when the number is dropped.

FIELD_COVERAGE_DIFF.md CHANGED Viewed

@@ -2,8 +2,6 @@
 Date: 2026-04-17 09:53
-## Lead
 The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values — translating Hindi symptom phrases to English labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") — and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
 ## Summary

 Date: 2026-04-17 09:53
 The fine-tuned sakhi model matched the base model on 14/15 end-to-end tests with comparable latency (19.0s vs 18.7s avg). While the base model extracted more raw fields on average (11 vs 2 unique extractions), the fine-tune produced more consistent schema-normalized values — translating Hindi symptom phrases to English labels (e.g., "दस्त" → "Diarrhea", "चक्कर आ रहे हैं" → "dizziness") — and recovered two visit-type-specific fields the base model missed (`anc_details.facility_or_home`, `visit_info.hbyc_visit_month`). Base model was kept in production for the single-test accuracy edge; the fine-tune demonstrates the training pipeline can produce a safer, more consistent alternative.
 ## Summary

JUDGE_BRIEF.md CHANGED Viewed

@@ -10,8 +10,6 @@ India's 1 million+ ASHA health workers conduct 50M+ maternal and child home visi
 Sakhi converts Hindi home-visit conversations (voice on a shared health-center workstation, text on the ASHA's phone offline) into structured NHM/MCTS forms + a function-calling-powered danger-sign triage that flags referrals with verbatim utterance evidence. Same pipeline, same anti-hallucination validation, two deployment modes: Whisper-Large + Gemma 4 E4B via Ollama on a workstation for accuracy, and Gemma 4 E2B via Cactus SDK on an Android phone for offline resilience.
-![App screenshot placeholder — populated after Bareilly field trip](docs/screenshot-placeholder.png)
 ## Numbers a judge can check
 | Measurement | Value | Source |
@@ -33,13 +31,15 @@ The 5-minute on-device figure is tested against the `ms2_0425` ANC preeclampsia
 | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) — not a research demo. |
 | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
 | **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep → LoRA train → GGUF export → Ollama registration → A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate — we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
-| **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 — shipping it would be demo-theater with clinical harm potential). |
 ## Reproduce in under 10 minutes
 **Health-center mode (workstation only):**
 ```bash
-pip install -r requirements.txt && ollama pull gemma4:e4b
 cd frontend && npm install && npm run build && cd ..
 python api.py        # browser: http://localhost:8000
 ```

 Sakhi converts Hindi home-visit conversations (voice on a shared health-center workstation, text on the ASHA's phone offline) into structured NHM/MCTS forms + a function-calling-powered danger-sign triage that flags referrals with verbatim utterance evidence. Same pipeline, same anti-hallucination validation, two deployment modes: Whisper-Large + Gemma 4 E4B via Ollama on a workstation for accuracy, and Gemma 4 E2B via Cactus SDK on an Android phone for offline resilience.
 ## Numbers a judge can check
 | Measurement | Value | Source |
 | **Health & Sciences** | A clinical-decision-support tool with explicit human-in-the-loop design, 6-layer anti-hallucination, strict-evidence danger-sign grounding, demographics entered as a typed header (the way every clinical EMR does it, so identifiers don't depend on ASR), and a real ASHA workflow (health-center mode + field mode with later sync) — not a research demo. |
 | **Ollama** | Native function calling via `tools=` parameter for `extract_form` + `flag_danger_sign` + `issue_referral` in a single inference pass, quantized Gemma 4 E4B Q4_K_M served on LAN to any phone on the same WiFi. One command (`python api.py`) starts the full stack. |
 | **Unsloth** | Honest reproducible LoRA pipeline in `scripts/train_unsloth.py`: data prep → LoRA train → GGUF export → Ollama registration → A/B eval vs base. Published artifacts: `RETRAIN_RESULTS.md`, `FIELD_COVERAGE_DIFF.md`. Fine-tune didn't beat base on pass-rate — we shipped the base and documented the fine-tune's specific wins (English schema-label normalization, visit-type-specific field recovery) rather than inflate the narrative. |
+| **Cactus** | Genuine on-device integration: custom Capacitor plugin bridging JS ↔ Cactus Kotlin SDK, JS pipeline port that drives either the Cactus engine or the workstation engine through a single `engine.complete()` contract, null-filled instance template prompting pattern that sidesteps E2B INT4's schema-echo failure mode, in-app SAF zip-import so a judge can install the 4.4 GB model without adb or developer tooling (single-pass extract with 1%/heartbeat progress events; auto-evicts stale model dirs on re-import), and a Developer-view toggle that shows raw per-stage model output for verifiable extraction. We investigated on-device voice-in via `cactusTranscribe` + Gemma; documented in the README why it's not shipped (Gemma 4 doesn't serve Cactus's ASR path, and off-the-shelf Whisper-Hindi INT4 has 27–70% WER on rural/clinical Hindi per arXiv 2512.10967 — shipping it would cause clinical harm). |
 ## Reproduce in under 10 minutes
+**Live demo (no install):** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi). Same stack as a local install on a T4. ~5 min cold-boot wait after idle (Space runs on ephemeral disk). For instant evaluation, use the demo video or run locally below.
 **Health-center mode (workstation only):**
 ```bash
+pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M
 cd frontend && npm install && npm run build && cd ..
 python api.py        # browser: http://localhost:8000
 ```

README.md CHANGED Viewed

@@ -17,13 +17,15 @@ Offline-first tool that converts Hindi home visit conversations into structured
 **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
 **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
 ## Problem
 India's ASHA workers conduct 50M+ maternal/child health home visits per year across rural areas. Every visit ends with paper forms filled from memory, then physically carried to the Primary Health Center. Danger signs observed in the field — preeclampsia, postpartum hemorrhage, neonatal distress — often never reach the system in time for intervention.
 ## Solution
-One product, one extraction schema, one anti-hallucination pipeline — deployed two ways to match ASHA working reality:
 - **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
 - **Field mode (phone)** has two offline sub-paths:
@@ -84,7 +86,7 @@ The pipeline uses a hybrid design: form extraction via `format="json"` (proven p
 ## Reproducing the demo
-Two reproduction paths, calibrated to how much friction the reviewer wants to accept.
 **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
@@ -225,7 +227,7 @@ cd android && ./gradlew assembleDebug
 # ── On-device Cactus model (for field mode) ──
 # Two install paths. Pick one.
 #
-# (A) PRIMARY — judges / non-developers — no adb required:
 #   1. Accept the Cactus-Compute terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it
 #   2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) to a PC, then transfer to
 #      the phone's Downloads folder via USB cable (MTP) or USB-OTG drive.
@@ -262,7 +264,11 @@ python scripts/compare_field_coverage.py        # Field-level diff base vs sakhi
 ## Public Demo — HuggingFace Space
-A reviewer-facing deployment runs on a HuggingFace Space (Docker SDK, T4 small GPU). The Space serves the same `python api.py` stack as a local install — same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline — just on cloud hardware so reviewers without a GPU can verify the workstation path.
 **Files driving the deploy:**
@@ -286,13 +292,15 @@ git remote add hf https://huggingface.co/spaces/<user>/sakhi
 git push hf master
 # In the HF Space UI, set:
-#   Hardware  → T4 small
-#   Storage   → small (20 GB, persistent at /data — caches Whisper + Ollama
-#                weights across restarts; without it, each cold boot re-downloads
-#                ~7 GB and the first request waits 3–5 min)
 ```
-On first boot the container pulls `gemma4:e4b-it-q4_K_M` into the persistent volume (~3 min) and warms it with a one-token generate so the first user request lands hot. The FastAPI startup hook eagerly loads Whisper-Large CT2 from `Tushar9802/whisper-large-v2-hindi-ct2` (~3 GB, cached under `$HF_HOME` on the persistent volume after the first boot). The Space only reports ready when both models are resident. Subsequent restarts read everything from `/data` and are fast.
 **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.

 **Tracks:** Health & Sciences | Ollama | Unsloth | Cactus (Android APK)
 **Partner frameworks:** [Gemma 4](https://blog.google/technology/developers/gemma-3/) (E2B + E4B), [Cactus SDK](https://github.com/cactus-compute/cactus) (on-device Android), [Ollama](https://ollama.ai) (workstation GPU), [Unsloth](https://unsloth.ai) (LoRA fine-tune), [Whisper](https://github.com/openai/whisper) (Hindi ASR via CTranslate2)
+![Workstation demo: Hindi audio → form + danger signs (30 s)](workstation_demo.gif)
 ## Problem
 India's ASHA workers conduct 50M+ maternal/child health home visits per year across rural areas. Every visit ends with paper forms filled from memory, then physically carried to the Primary Health Center. Danger signs observed in the field — preeclampsia, postpartum hemorrhage, neonatal distress — often never reach the system in time for intervention.
 ## Solution
+Single product, two deployments. Same schema, same anti-hallucination pipeline. Matched to how ASHA workers actually operate:
 - **Health-center mode (workstation + E4B via Ollama)** — sub-center / PHC / camp with a shared workstation. Phone records Hindi audio → LAN upload → Whisper ASR + Gemma 4 E4B on GPU with native function calling → structured JSON back to phone. Fast (~15 s) and accurate. This is the primary voice-to-form path.
 - **Field mode (phone)** has two offline sub-paths:
 ## Reproducing the demo
+Two reproduction paths. Pick by available hardware.
 **Path 1 — workstation, ~5 minutes (recommended for reviewers).** Runs the full pipeline (Whisper + Gemma 4 E4B via Ollama) on any CUDA workstation with ≥10 GB VRAM (the E4B Q4_K_M model is ~9 GB resident). No phone needed; same extraction code, same anti-hallucination validation, same form output. With Ollama running, the three commands are `pip install -r requirements-hf.txt && ollama pull gemma4:e4b-it-q4_K_M && python api.py` then open `http://localhost:8000`. Note the slim `requirements-hf.txt` — inference goes through Ollama + faster-whisper, so PyTorch / Unsloth / bitsandbytes from the full `requirements.txt` are training-only and not needed here. Voice-to-form, text-to-form, and queue-and-sync flows all run on this stack. This is sufficient to verify our engineering claims (function calling, normalization, 6-layer validation, schema correctness).
 # ── On-device Cactus model (for field mode) ──
 # Two install paths. Pick one.
 #
+# (A) PRIMARY — no developer tooling required:
 #   1. Accept the Cactus-Compute terms at huggingface.co/Cactus-Compute/gemma-4-E2B-it
 #   2. Download gemma-4-e2b-it-int4.zip (~4.4 GB) to a PC, then transfer to
 #      the phone's Downloads folder via USB cable (MTP) or USB-OTG drive.
 ## Public Demo — HuggingFace Space
+**Try it live:** [https://huggingface.co/spaces/Tushar9802/sakhi](https://huggingface.co/spaces/Tushar9802/sakhi) — same `python api.py` stack as a local install, running on a T4 GPU. Same React UI, same FastAPI endpoints, same Whisper + Ollama pipeline; just on cloud hardware so reviewers without their own GPU can exercise the workstation path.
+**Heads-up on cold-boot wait.** The Space runs on ephemeral disk, so the first request after it's been idle (~15 min) pays a ~5 min cold-boot wait while the 9 GB Gemma model and 3 GB Whisper CT2 mirror download and load into VRAM. For instant evaluation see the 3-minute demo video, or follow Path 1 above to run locally — the live Space exists for convenience, not as the rigorous evaluation path.
+### How it's deployed
 **Files driving the deploy:**
 git push hf master
 # In the HF Space UI, set:
+#   Hardware       → T4 small (or larger)
+#   Storage        → ephemeral disk only. Every cold-boot re-downloads
+#                     + loads ~12 GB.
+#   Sleep time     → 1 h. Captures intra-hour clustering of reviewer
+#                     traffic without keeping the GPU billed 24/7.
+#   Visibility     → Public.
 ```
+On every cold-boot the container pulls `gemma4:e4b-it-q4_K_M` (~9 GB, ~80 s @ 100 MB/s) and the FastAPI startup hook downloads + loads Whisper-Large CT2 from `Tushar9802/whisper-large-v2-hindi-ct2` (~3 GB). The Space only marks ready when both models are resident, so the first request after a sleep pays a ~5 min wait.
 **Subsequent updates:** `git push hf master` after any code change; HF rebuilds and redeploys.

RETRAIN_RESULTS.md CHANGED Viewed

@@ -13,21 +13,9 @@
 ## Verdict
-**BASE MODEL WINS — keep using gemma4:e4b-it-q4_K_M**
-Fine-tuning did not improve quality. Skip Unsloth track.
-## Base Model Details
-```
-```
-## Fine-Tuned Model Details
-```
-```
 ## Diagnostics

 ## Verdict
+**Base wins on pass-rate (15/15 vs 14/15) and ships as the production path.**
+The fine-tune misses one Hinglish code-switching case where it raises referral urgency one level — a safer failure mode (over-refer rather than under-refer), but a miss against the rubric. It is kept available in Ollama as `sakhi:latest` for deployments that prefer consistent English schema labels (`दस्त` → `Diarrhea`, `चक्कर` → `dizziness`) over raw Hindi transcription. See `FIELD_COVERAGE_DIFF.md` for the field-level diff and `FAILURES.md` for the root cause of the single Hinglish miss.
 ## Diagnostics

demo_audio/manifest.json CHANGED Viewed

@@ -6,7 +6,7 @@
     "duration_s": 20,
     "visit_type_hint": "anc_visit",
     "speaker": "male",
-    "description": "Short ANC clip. ASHA reads BP 155/100 — the danger-sign threshold trigger. Demonstrates the danger pipeline end-to-end on a clip a judge will actually sit through."
   },
   {
     "id": "anc_preeclampsia_full",
@@ -15,6 +15,6 @@
     "duration_s": 52,
     "visit_type_hint": "anc_visit",
     "speaker": "male",
-    "description": "Full ANC home-visit role-play. Headache, blurred vision, facial swelling, BP 155/100, late-pregnancy context (≈8 months) — multiple preeclampsia danger signs plus a PHC-referral decision."
   }
 ]

     "duration_s": 20,
     "visit_type_hint": "anc_visit",
     "speaker": "male",
+    "description": "ASHA reads BP 160/110 — above the preeclampsia threshold. Short clip that exercises the full danger pipeline (ASR → normalize → form → danger flag)."
   },
   {
     "id": "anc_preeclampsia_full",
     "duration_s": 52,
     "visit_type_hint": "anc_visit",
     "speaker": "male",
+    "description": "Full ANC home-visit role-play. Headache, blurred vision, facial swelling, BP 160/110, late-pregnancy context (≈8 months) — multiple preeclampsia danger signs plus a PHC-referral decision."
   }
 ]