Automatic Speech Recognition

GGUF + pure-C++ runtime in CrispASR β€” OmniASR-CTC-300M

#3
by cstr - opened

We've added OmniASR-CTC-300M to CrispASR as the omniasr backend.

Three things bit hard during the port (all documented in our LEARNINGS.md "OmniASR-CTC: three critical findings"):

  1. Input must be layer-normalised waveform (zero mean, unit variance) β€” without this the CTC head emits mostly blanks.
  2. CTC blank = token 0 (<s>), not token 1 (<pad>) like HF wav2vec2. fairseq2 convention.
  3. pos_conv padding = K // 2 (=64 for K=128), not (K-1)//2. Fixes same-padding for fairseq2 Conv1d.

Architecture: 7-layer CNN (Conv1d strides [5,2,2,2,2,2,2] = 320Γ— downsampling) + Linear(512β†’1024) + 24L transformer (d=1024, 16 heads, FFN=4096, pre-norm, GELU) + CTC head (1024β†’9812). Raw 16 kHz PCM, no mel. ~1600 languages.

CTC = no native punctuation; pair with --punc-model fullstop-punc-q4_k.gguf (XLM-R-large, DE/EN/FR/IT) or fireredpunc-q8_0.gguf (BERT-base, EN+CN).

Pre-quantised GGUFs (Apache-2.0): cstr/omniASR-CTC-300M-v2-GGUF

./build/bin/crispasr --backend omniasr -m omniasr-ctc-300m-q4_k.gguf -l fr \
    -f audio.wav --punc-model fullstop-punc-q4_k.gguf

CrispASR's omniasr backend also handles the 1B CTC variant and the autoregressive LLM variants β€” same source, GGUF metadata dispatch.

Sign up or log in to comment