--- license: apache-2.0 language: - en library_name: coreai pipeline_tag: text-to-speech tags: - text-to-speech - tts - core-ai - coreml - on-device - styletts2 - kokoro base_model: hexgrad/Kokoro-82M --- # Kokoro-82M — Core AI [`hexgrad/Kokoro-82M`](https://huggingface.co/hexgrad/Kokoro-82M) (Apache-2.0), a tiny high-quality **StyleTTS2 + iSTFTNet** text-to-speech model (82M params, 24 kHz), converted to Apple **Core AI** (`.aimodel`, iOS 27 / macOS 27) — the [CoreAI-Model-Zoo](https://github.com/john-rocky/coreai-model-zoo)'s first TTS. Non-autoregressive: phonemes + a voice/style vector → a waveform in one pass. Runs fully on-device, English-first, with grapheme→phoneme on the host. ## Bundles The acoustic graph has one data-dependent length (the duration→alignment expansion), so it is cut into **three voice-independent `.aimodel` bundles** with two cheap host steps between them: | file | in → out | |---|---| | `kokoro_predictor.aimodel` | `input_ids[1,128]` i32, `ref_s[1,256]`, `attn_mask[1,128]` → `duration`, `d`, `t_en` | | `kokoro_prosody.aimodel` | `d`, `t_en`, `aln[1,128,512]`, `ref_s`, `frame_mask[1,512]` → `asr`, `F0`, `N` | | `kokoro_vocoder.aimodel` | `asr`, `F0`, `N`, `har`, `ref_s`, `frame_mask` → `audio[1, L·600]` | `voices/*.pt` — the **28 English voice packs** (Apache-2.0). The voice is the `ref_s` input: `ref_s = pack[len(ids)−1]`. Quality leaders: `af_heart`, `af_bella`, `af_nicole`, `bf_emma`. Token length **T** and frame length **L** are fixed **buckets** (128 / 512); the host left-pads to the bucket and trims the output. Longer text is split into sentences host-side. Run on the Core AI **CPU** compute unit. ~0.75 s / utterance on M4 Max, ~335 MB total (fp32). ## Host steps ``` text ──(misaki G2P)──▶ ids ──▶ predictor ──▶ [build alignment] ──▶ prosody ──▶ [har = STFT(SineGen(f0_upsamp(F0)))] ──▶ vocoder ──▶ [trim] ──▶ 24 kHz audio ``` G2P is [misaki](https://github.com/hexgrad/misaki) (`misaki[en]`, no espeak for English); on-device [MisakiSwift](https://github.com/mlalma/MisakiSwift) gives the same English phonemes. `har` (the hn-nsf source's STFT) is a windowed FFT computed on the host — the one piece that must stay off the engine (its `atan2` phase flips 2π at the F0→0 pad boundary under fp32). ## Quality The hn-nsf source phase is arbitrary (stock Kokoro randomizes it), so the gate is spectral: **magnitude-spectrogram correlation 0.999** vs the PyTorch reference (`af_heart`, multiple sentences). Raw waveform correlation ~0.98 — the bounded, inaudible effect of the bucket pad boundary. ## Convert / re-bucket [`conversion/export_kokoro.py`](https://github.com/john-rocky/coreai-model-zoo/blob/main/conversion/export_kokoro.py) (`python export_kokoro.py --out-dir out`; `--verify` runs the engine-vs-torch spectral gate; `--token-bucket` / `--frame-bucket` to re-size). Card + the full port write-up: [`zoo/kokoro-82m.md`](https://github.com/john-rocky/coreai-model-zoo/blob/main/zoo/kokoro-82m.md). ## License Apache-2.0 (model weights and the 28 English voices). The Core AI export code derives from Apple's BSD-3-Clause `coreai_models`.