Zipformer Transducer XL 290M

Offline English ASR model based on the Icefall/K2 pruned transducer Zipformer recipe. The model is intended for Icefall transducer decoding.

Files

  • model.pt: Zipformer transducer model
  • bpe.model: SentencePiece BPE model
  • tokens.txt: Icefall token table exported from bpe.model
  • config.yaml: model architecture, feature extraction, tokenizer, Zipformer model, decoding settings, and Hugging Face Hub download-metrics query file

Evaluation

Open ASR Leaderboard English short-form result, decoded with modified_beam_search and beam size 6:

Metric Value
Average WER 6.79
RTFx 100.06
Parameters 288M

Dataset WERs:

Dataset WER
AMI 14.43
Earnings22 9.38
GigaSpeech 10.46
LibriSpeech test-clean 2.08
LibriSpeech test-other 5.03
SPGISpeech 2.08
TED-LIUM 4.05
VoxPopuli 6.81
Average 6.79

Training Data

The model was trained on a combined English training mixture built from the training portions of the datasets below.

Dataset Train Hours Source
LibriSpeech 960.0 ESB datasets
Earnings-22 105.0 ESB datasets
AMI Meeting Corpus 78.0 ESB datasets
Common Voice Scripted Speech 25.0 - English ~1,679.0 Mozilla Data Collective
Common Voice Spontaneous Speech 3.0 - English ~3.6 Mozilla Data Collective
GigaSpeech XL 10,000.0 ESB datasets
SPGISpeech 4,900.0 ESB datasets
TED-LIUM Release 3 454.0 ESB datasets
VoxPopuli 523.0 ESB datasets
Total ~18,700 ESB datasets and Mozilla Data Collective

Common Voice Scripted Speech 25.0 - English and Common Voice Spontaneous Speech 3.0 - English are the English scripted and spontaneous Common Voice releases checked on May 12, 2026. Mozilla Data Collective lists them with March 30, 2026 and March 22, 2026 release dates, respectively.

For data normalization, the training transcripts were processed with direct LLM normalization using a self-hosted GLM-4.7 setup, plus custom agentic workflows built in collaboration with Claude Code 4.7 and Codex 5.4.

Training

The model was trained with the k2-fsa/Icefall framework, PyTorch 2.10, and CUDA 13.0 for 35 epochs on 8 NVIDIA B200 GPUs using bf16 automatic mixed precision. Training took approximately 4 days.

Usage With Icefall

From an Icefall checkout, download this model repo locally:

huggingface-cli download soundsgoodai/Zipformer-transducer-XL-290M \
  --local-dir Zipformer-transducer-XL-290M

Run offline decoding through Icefall. Input audio must already be 16 kHz:

cd icefall/egs/librispeech/ASR

PYTHONPATH=../../.. python zipformer/pretrained.py \
  --checkpoint /path/to/Zipformer-transducer-XL-290M/model.pt \
  --tokens /path/to/Zipformer-transducer-XL-290M/tokens.txt \
  --method modified_beam_search \
  --beam-size 6 \
  --num-encoder-layers "2,2,4,5,4,2" \
  --downsampling-factor "1,2,4,8,4,2" \
  --feedforward-dim "512,1024,2048,3072,2048,1024" \
  --num-heads "4,4,4,8,4,4" \
  --encoder-dim "192,384,768,1024,768,384" \
  --encoder-unmasked-dim "192,192,320,384,320,192" \
  --query-head-dim "32" \
  --value-head-dim "12" \
  --pos-head-dim "4" \
  --pos-dim 48 \
  --cnn-module-kernel "31,31,15,15,15,31" \
  --decoder-dim 512 \
  --joiner-dim 512 \
  --context-size 2 \
  --causal false \
  --chunk-size "16,32,64,-1" \
  --left-context-frames "64,128,256,-1" \
  --use-transducer true \
  --use-ctc false \
  --use-attention-decoder false \
  --use-cr-ctc false \
  /path/to/audio_1.wav \
  /path/to/audio_2.wav

The architecture and decoding values above are also recorded in config.yaml.

Decoding Methods

The model supports the following Icefall decoding methods:

  • greedy_search
  • modified_beam_search
  • fast_beam_search

The reported Open ASR Leaderboard result uses modified_beam_search with beam size 6.

Feature Extraction And Resampling

For this model, use Kaldi-style fbank features. kaldifeat and kaldi-native-fbank are the recommended feature extraction backends.

Audio should be mono 16 kHz before feature extraction. For sample-rate conversion, audioop.ratecv is recommended because it matches the resampling path used for evaluation. On Python 3.13 and newer, use an audioop-compatible package such as audioop-lts.

import audioop
import numpy as np


def resample_to_16k_audioop(
    audio: np.typing.NDArray[np.float32],
    source_sample_rate: int,
) -> np.typing.NDArray[np.float32]:
    target_sample_rate = 16000
    max_signed_int = np.float32(32768.0)

    # Expected shape and dtype: (num_samples,), mono float32 audio in [-1.0, 1.0].
    if audio.ndim != 1:
        raise ValueError(f"Expected 1-D mono audio, got shape {audio.shape}.")
    if audio.dtype != np.float32:
        raise TypeError(f"Expected float32 audio, got {audio.dtype}.")

    audio = np.clip(audio, -1.0, 1.0)
    pcm16 = (audio * max_signed_int).astype(np.int16).tobytes()
    resampled_pcm16, _ = audioop.ratecv(
        pcm16,
        2,  # int16 sample width in bytes
        1,  # mono
        source_sample_rate,
        target_sample_rate,
        None,
    )
    resampled_audio = np.frombuffer(resampled_pcm16, dtype=np.int16).astype(np.float32)
    resampled_audio /= max_signed_int
    return resampled_audio

Output Formatting

The model emits normalized English text with punctuation and capitalization. It does not automatically capitalize the first word of every sentence unless that word is normally capitalized, such as a proper noun, honorific, acronym, or similar named expression.

The model also normalizes common written forms, including numbers, dates, and currency, to digit-based forms.

Open ASR Leaderboard Evaluation

The open_asr_leaderboard/soundsgoodai runner downloads this model repo, resamples audio to 16 kHz with audioop, computes fbank features with kaldi-native-fbank, and decodes with Icefall modified_beam_search.

cd open_asr_leaderboard/soundsgoodai
bash run_zipformer.sh

Notes

  • The model expects 16 kHz mono audio.
  • The packaged decoding configuration uses modified_beam_search with beam size 6.
  • The checkpoint is an offline ASR model and is not intended for streaming decoding.

References

Downloads last month
38
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train soundsgoodai/Zipformer-transducer-XL-290M

Paper for soundsgoodai/Zipformer-transducer-XL-290M