# Cohere Transcribe ASR — Setup Guide

## Overview

The voice journal pipeline uses **CohereLabs/cohere-transcribe-03-2026**, a
2 B-parameter conformer encoder + lightweight transformer decoder trained
from scratch for ASR. It is gated on the Hugging Face Hub (you must
accept the model terms once with your account) and supports 14
languages:

- **European:** English, French, German, Italian, Spanish, Portuguese,
  Greek, Dutch, Polish
- **APAC:** Chinese (Mandarin), Japanese, Korean, Vietnamese
- **MENA:** Arabic

The model is Apache 2.0 licensed and is integrated into the journal
pipeline so players can speak during a game instead of typing.

## Why this configuration

1. **Sponsor visibility** — Cohere Labs is a hackathon sponsor.
2. **State-of-the-art accuracy** — 5.42 mean WER on the Open ASR
   Leaderboard (5.x–10.x WER across real-world domains) and 1.25 WER on
   LibriSpeech clean.
3. **Production runtime** — supports 🤗 Transformers (offline),
   vLLM, mlx-audio, Rust, and a WebGPU browser demo.
4. **Lazy loading** — the model is downloaded on first use, never at
   app startup, so demo boot is unaffected.

## Installation

### 1. Accept the model terms

Visit <https://huggingface.co/CohereLabs/cohere-transcribe-03-2026>,
click **Agree and access repository** with the account you plan to
authenticate as.

### 2. Install dependencies

```bash
pip install 'transformers>=5.4.0' torch huggingface_hub \
            soundfile librosa sentencepiece protobuf
```

(These are added to `requirements.txt` for the demo; `transformers` is
already present.)

### 3. Provide an HF token

Set a token in the environment so the gated model can be downloaded:

```bash
export HF_TOKEN=hf_xxx...    # Linux/macOS
$env:HF_TOKEN="hf_xxx..."     # PowerShell
```

On Hugging Face Spaces, create a `HF_TOKEN` secret (same name as
`huggingface` in `modal_serve.py`/`modal_train.py`).

## How the pipeline uses it

```text
Gradio microphone / upload
   ↓
app.py: record_journal(audio_path, language)
   ↓
app/services/asr.py: transcribe(audio_path, language)
   ↓
CohereAsrForConditionalGeneration  ←  CohereLabs/cohere-transcribe-03-2026
   ↓
transcript
   ↓
app/services/journal.py: create_journal_entry(...)
   ↓
app/logs/journals.jsonl
```

Each journal entry now carries:

- `transcript_source` — `"typed" | "asr" | "hybrid"`
- `audio_ref` — path of the recorded audio clip
- `asr` — `{ model, language, status, error }`

The `journal_recorded` event log also includes `transcript_source`,
`asr_status`, and `asr_model` for full traceability.

## Skipping the model in tests

To run the demo without downloading the model, set either of:

```bash
CITYQUEST_SKIP_MODEL=1
CITYQUEST_FAST_TEST=1
```

When set, `app.services.asr.transcribe()` returns
`status="skipped"` with an empty transcript. The journal pipeline
silently falls back to typed input.

## Verification

```bash
$env:CITYQUEST_FAST_TEST="1"
.\.venv\Scripts\python.exe test_asr.py
.\.venv\Scripts\python.exe test_end_to_end.py
```

Expected: 30/30 ASR tests + 86/86 end-to-end tests pass in skip-mode.

## Limitations (per the model card)

1. **Single language per call** — pick the right language code; the
   model does not auto-detect or handle code-switching well.
2. **No diarization or timestamps** — only plain text is returned.
3. **Eager on silence** — prepend a VAD/silence gate if the recording
   has noisy backgrounds; otherwise the model may hallucinate.

## File map

| File | Purpose |
| --- | --- |
| `app/services/asr.py` | Lazy-loaded Cohere Transcribe wrapper. |
| `app/services/journal.py` | `transcribe_journal()` and `create_journal_entry()` now accept ASR metadata. |
| `app/schemas/journal_schema.json` | Optional `transcript_source`, `audio_ref`, `asr` fields. |
| `app.py` | Gradio audio component + `record_journal()` voice path. |
| `test_asr.py` | Skip-mode tests for the ASR pipeline. |
| `requirements.txt` | Optional ASR runtime deps. |