# Cohere Transcribe ASR — Setup Guide ## Overview The voice journal pipeline uses **CohereLabs/cohere-transcribe-03-2026**, a 2 B-parameter conformer encoder + lightweight transformer decoder trained from scratch for ASR. It is gated on the Hugging Face Hub (you must accept the model terms once with your account) and supports 14 languages: - **European:** English, French, German, Italian, Spanish, Portuguese, Greek, Dutch, Polish - **APAC:** Chinese (Mandarin), Japanese, Korean, Vietnamese - **MENA:** Arabic The model is Apache 2.0 licensed and is integrated into the journal pipeline so players can speak during a game instead of typing. ## Why this configuration 1. **Sponsor visibility** — Cohere Labs is a hackathon sponsor. 2. **State-of-the-art accuracy** — 5.42 mean WER on the Open ASR Leaderboard (5.x–10.x WER across real-world domains) and 1.25 WER on LibriSpeech clean. 3. **Production runtime** — supports 🤗 Transformers (offline), vLLM, mlx-audio, Rust, and a WebGPU browser demo. 4. **Lazy loading** — the model is downloaded on first use, never at app startup, so demo boot is unaffected. ## Installation ### 1. Accept the model terms Visit , click **Agree and access repository** with the account you plan to authenticate as. ### 2. Install dependencies ```bash pip install 'transformers>=5.4.0' torch huggingface_hub \ soundfile librosa sentencepiece protobuf ``` (These are added to `requirements.txt` for the demo; `transformers` is already present.) ### 3. Provide an HF token Set a token in the environment so the gated model can be downloaded: ```bash export HF_TOKEN=hf_xxx... # Linux/macOS $env:HF_TOKEN="hf_xxx..." # PowerShell ``` On Hugging Face Spaces, create a `HF_TOKEN` secret (same name as `huggingface` in `modal_serve.py`/`modal_train.py`). ## How the pipeline uses it ```text Gradio microphone / upload ↓ app.py: record_journal(audio_path, language) ↓ app/services/asr.py: transcribe(audio_path, language) ↓ CohereAsrForConditionalGeneration ← CohereLabs/cohere-transcribe-03-2026 ↓ transcript ↓ app/services/journal.py: create_journal_entry(...) ↓ app/logs/journals.jsonl ``` Each journal entry now carries: - `transcript_source` — `"typed" | "asr" | "hybrid"` - `audio_ref` — path of the recorded audio clip - `asr` — `{ model, language, status, error }` The `journal_recorded` event log also includes `transcript_source`, `asr_status`, and `asr_model` for full traceability. ## Skipping the model in tests To run the demo without downloading the model, set either of: ```bash CITYQUEST_SKIP_MODEL=1 CITYQUEST_FAST_TEST=1 ``` When set, `app.services.asr.transcribe()` returns `status="skipped"` with an empty transcript. The journal pipeline silently falls back to typed input. ## Verification ```bash $env:CITYQUEST_FAST_TEST="1" .\.venv\Scripts\python.exe test_asr.py .\.venv\Scripts\python.exe test_end_to_end.py ``` Expected: 30/30 ASR tests + 86/86 end-to-end tests pass in skip-mode. ## Limitations (per the model card) 1. **Single language per call** — pick the right language code; the model does not auto-detect or handle code-switching well. 2. **No diarization or timestamps** — only plain text is returned. 3. **Eager on silence** — prepend a VAD/silence gate if the recording has noisy backgrounds; otherwise the model may hallucinate. ## File map | File | Purpose | | --- | --- | | `app/services/asr.py` | Lazy-loaded Cohere Transcribe wrapper. | | `app/services/journal.py` | `transcribe_journal()` and `create_journal_entry()` now accept ASR metadata. | | `app/schemas/journal_schema.json` | Optional `transcript_source`, `audio_ref`, `asr` fields. | | `app.py` | Gradio audio component + `record_journal()` voice path. | | `test_asr.py` | Skip-mode tests for the ASR pipeline. | | `requirements.txt` | Optional ASR runtime deps. |