Zipformer Transducer XL 290M

Offline English ASR model based on the Icefall/K2 pruned transducer Zipformer recipe. The model is intended for Icefall transducer decoding.

Files

model.pt: Zipformer transducer model
bpe.model: SentencePiece BPE model
tokens.txt: Icefall token table exported from bpe.model
config.yaml: model architecture, feature extraction, tokenizer, Zipformer model, decoding settings, and Hugging Face Hub download-metrics query file

Evaluation

Open ASR Leaderboard English short-form result, decoded with modified_beam_search and beam size 6:

Metric	Value
Average WER	6.79
RTFx	100.06
Parameters	288M

Dataset WERs:

Dataset	WER
AMI	14.43
Earnings22	9.38
GigaSpeech	10.46
LibriSpeech test-clean	2.08
LibriSpeech test-other	5.03
SPGISpeech	2.08
TED-LIUM	4.05
VoxPopuli	6.81
Average	6.79

Training Data

The model was trained on a combined English training mixture built from the training portions of the datasets below.

Dataset	Train Hours	Source
LibriSpeech	960.0	ESB datasets
Earnings-22	105.0	ESB datasets
AMI Meeting Corpus	78.0	ESB datasets
Common Voice Scripted Speech 25.0 - English	~1,679.0	Mozilla Data Collective
Common Voice Spontaneous Speech 3.0 - English	~3.6	Mozilla Data Collective
GigaSpeech XL	10,000.0	ESB datasets
SPGISpeech	4,900.0	ESB datasets
TED-LIUM Release 3	454.0	ESB datasets
VoxPopuli	523.0	ESB datasets
Total	~18,700	ESB datasets and Mozilla Data Collective

Common Voice Scripted Speech 25.0 - English and Common Voice Spontaneous Speech 3.0 - English are the English scripted and spontaneous Common Voice releases checked on May 12, 2026. Mozilla Data Collective lists them with March 30, 2026 and March 22, 2026 release dates, respectively.

For data normalization, the training transcripts were processed with direct LLM normalization using a self-hosted GLM-4.7 setup, plus custom agentic workflows built in collaboration with Claude Code 4.7 and Codex 5.4.

Training

The model was trained with the k2-fsa/Icefall framework, PyTorch 2.10, and CUDA 13.0 for 35 epochs on 8 NVIDIA B200 GPUs using bf16 automatic mixed precision. Training took approximately 4 days.

Usage With Icefall

From an Icefall checkout, download this model repo locally:

huggingface-cli download soundsgoodai/Zipformer-transducer-XL-290M \
  --local-dir Zipformer-transducer-XL-290M

Run offline decoding through Icefall. Input audio must already be 16 kHz:

cd icefall/egs/librispeech/ASR

PYTHONPATH=../../.. python zipformer/pretrained.py \
  --checkpoint /path/to/Zipformer-transducer-XL-290M/model.pt \
  --tokens /path/to/Zipformer-transducer-XL-290M/tokens.txt \
  --method modified_beam_search \
  --beam-size 6 \
  --num-encoder-layers "2,2,4,5,4,2" \
  --downsampling-factor "1,2,4,8,4,2" \
  --feedforward-dim "512,1024,2048,3072,2048,1024" \
  --num-heads "4,4,4,8,4,4" \
  --encoder-dim "192,384,768,1024,768,384" \
  --encoder-unmasked-dim "192,192,320,384,320,192" \
  --query-head-dim "32" \
  --value-head-dim "12" \
  --pos-head-dim "4" \
  --pos-dim 48 \
  --cnn-module-kernel "31,31,15,15,15,31" \
  --decoder-dim 512 \
  --joiner-dim 512 \
  --context-size 2 \
  --causal false \
  --chunk-size "16,32,64,-1" \
  --left-context-frames "64,128,256,-1" \
  --use-transducer true \
  --use-ctc false \
  --use-attention-decoder false \
  --use-cr-ctc false \
  /path/to/audio_1.wav \
  /path/to/audio_2.wav

The architecture and decoding values above are also recorded in config.yaml.

Decoding Methods

The model supports the following Icefall decoding methods:

greedy_search
modified_beam_search
fast_beam_search

The reported Open ASR Leaderboard result uses modified_beam_search with beam size 6.

Feature Extraction And Resampling

For this model, use Kaldi-style fbank features. kaldifeat and kaldi-native-fbank are the recommended feature extraction backends.

Audio should be mono 16 kHz before feature extraction. For sample-rate conversion, audioop.ratecv is recommended because it matches the resampling path used for evaluation. On Python 3.13 and newer, use an audioop-compatible package such as audioop-lts.

import audioop
import numpy as np


def resample_to_16k_audioop(
    audio: np.typing.NDArray[np.float32],
    source_sample_rate: int,
) -> np.typing.NDArray[np.float32]:
    target_sample_rate = 16000
    max_signed_int = np.float32(32768.0)

    # Expected shape and dtype: (num_samples,), mono float32 audio in [-1.0, 1.0].
    if audio.ndim != 1:
        raise ValueError(f"Expected 1-D mono audio, got shape {audio.shape}.")
    if audio.dtype != np.float32:
        raise TypeError(f"Expected float32 audio, got {audio.dtype}.")

    audio = np.clip(audio, -1.0, 1.0)
    pcm16 = (audio * max_signed_int).astype(np.int16).tobytes()
    resampled_pcm16, _ = audioop.ratecv(
        pcm16,
        2,  # int16 sample width in bytes
        1,  # mono
        source_sample_rate,
        target_sample_rate,
        None,
    )
    resampled_audio = np.frombuffer(resampled_pcm16, dtype=np.int16).astype(np.float32)
    resampled_audio /= max_signed_int
    return resampled_audio

Output Formatting

The model emits normalized English text with punctuation and capitalization. It does not automatically capitalize the first word of every sentence unless that word is normally capitalized, such as a proper noun, honorific, acronym, or similar named expression.

The model also normalizes common written forms, including numbers, dates, and currency, to digit-based forms.

Open ASR Leaderboard Evaluation

The open_asr_leaderboard/soundsgoodai runner downloads this model repo, resamples audio to 16 kHz with audioop, computes fbank features with kaldi-native-fbank, and decodes with Icefall modified_beam_search.

cd open_asr_leaderboard/soundsgoodai
bash run_zipformer.sh

Notes

The model expects 16 kHz mono audio.
The packaged decoding configuration uses modified_beam_search with beam size 6.
The checkpoint is an offline ASR model and is not intended for streaming decoding.

References

Zipformer paper: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, and Daniel Povey. "Zipformer: A faster and better encoder for automatic speech recognition." https://arxiv.org/abs/2310.11230
Icefall: https://github.com/k2-fsa/icefall
k2: https://github.com/k2-fsa/k2
kaldifeat: https://csukuangfj.github.io/kaldifeat/intro.html
kaldi-native-fbank: https://github.com/csukuangfj/kaldi-native-fbank
Python audioop.ratecv: https://docs.python.org/3.11/library/audioop.html#audioop.ratecv
audioop-lts for Python 3.13+: https://pypi.org/project/audioop-lts/
ESB datasets: https://huggingface.co/datasets/esb/datasets
LibriSpeech: https://www.openslr.org/12/
Earnings22: https://github.com/revdotcom/speech-datasets/tree/main/earnings22
AMI Meeting Corpus: https://www.idiap.ch/webarchives/sites/www.amiproject.org/ami-scientific-portal/meeting-corpus/
Common Voice Scripted Speech 25.0 - English: https://mozilladatacollective.com/datasets/cmndapwry02jnmh07dyo46mot
Common Voice Spontaneous Speech 3.0 - English: https://datacollective.mozillafoundation.org/datasets/cmn1pv5hi00uto1072y1074y7
GigaSpeech: https://github.com/SpeechColab/GigaSpeech
SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech
TED-LIUM Release 3: https://lium.univ-lemans.fr/en/ted-lium3/
VoxPopuli: https://github.com/facebookresearch/voxpopuli

Downloads last month: 38

Dataset used to train soundsgoodai/Zipformer-transducer-XL-290M

Paper for soundsgoodai/Zipformer-transducer-XL-290M

Zipformer: A faster and better encoder for automatic speech recognition

Paper • 2310.11230 • Published Oct 17, 2023 • 1

Evaluation results

Mean Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

6.79
Rtfx on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

100.06
Ami Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

14.43
Earnings22 Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

9.38
Gigaspeech Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

10.46
Librispeech Clean Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

2.08
Librispeech Other Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

5.03
Spgispeech Wer on hf-audio/open-asr-leaderboard View evaluation results

source leaderboard

2.08