GGUF + pure-C++ runtime in CrispASR — OmniASR-CTC-300M

by cstr - opened May 1

May 1

We've added OmniASR-CTC-300M to CrispASR as the omniasr backend.

Three things bit hard during the port (all documented in our LEARNINGS.md "OmniASR-CTC: three critical findings"):

Input must be layer-normalised waveform (zero mean, unit variance) — without this the CTC head emits mostly blanks.
CTC blank = token 0 (<s>), not token 1 (<pad>) like HF wav2vec2. fairseq2 convention.
pos_conv padding = K // 2 (=64 for K=128), not (K-1)//2. Fixes same-padding for fairseq2 Conv1d.

Architecture: 7-layer CNN (Conv1d strides [5,2,2,2,2,2,2] = 320× downsampling) + Linear(512→1024) + 24L transformer (d=1024, 16 heads, FFN=4096, pre-norm, GELU) + CTC head (1024→9812). Raw 16 kHz PCM, no mel. ~1600 languages.

CTC = no native punctuation; pair with --punc-model fullstop-punc-q4_k.gguf (XLM-R-large, DE/EN/FR/IT) or fireredpunc-q8_0.gguf (BERT-base, EN+CN).

Pre-quantised GGUFs (Apache-2.0): cstr/omniASR-CTC-300M-v2-GGUF

./build/bin/crispasr --backend omniasr -m omniasr-ctc-300m-q4_k.gguf -l fr \
    -f audio.wav --punc-model fullstop-punc-q4_k.gguf

CrispASR's omniasr backend also handles the 1B CTC variant and the autoregressive LLM variants — same source, GGUF metadata dispatch.

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment