Instructions to use soundsgoodai/Zipformer-transducer-XL-290M with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- K2
How to use soundsgoodai/Zipformer-transducer-XL-290M with K2:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Zipformer Transducer XL 290M
Offline English ASR model based on the Icefall/K2 pruned transducer Zipformer recipe. The model is intended for Icefall transducer decoding.
Files
model.pt: Zipformer transducer modelbpe.model: SentencePiece BPE modeltokens.txt: Icefall token table exported frombpe.modelconfig.yaml: model architecture, feature extraction, tokenizer, Zipformer model, decoding settings, and Hugging Face Hub download-metrics query file
Evaluation
Open ASR Leaderboard English short-form result, decoded with
modified_beam_search and beam size 6:
| Metric | Value |
|---|---|
| Average WER | 6.79 |
| RTFx | 100.06 |
| Parameters | 288M |
Dataset WERs:
| Dataset | WER |
|---|---|
| AMI | 14.43 |
| Earnings22 | 9.38 |
| GigaSpeech | 10.46 |
| LibriSpeech test-clean | 2.08 |
| LibriSpeech test-other | 5.03 |
| SPGISpeech | 2.08 |
| TED-LIUM | 4.05 |
| VoxPopuli | 6.81 |
| Average | 6.79 |
Training Data
The model was trained on a combined English training mixture built from the training portions of the datasets below.
| Dataset | Train Hours | Source |
|---|---|---|
| LibriSpeech | 960.0 | ESB datasets |
| Earnings-22 | 105.0 | ESB datasets |
| AMI Meeting Corpus | 78.0 | ESB datasets |
| Common Voice Scripted Speech 25.0 - English | ~1,679.0 | Mozilla Data Collective |
| Common Voice Spontaneous Speech 3.0 - English | ~3.6 | Mozilla Data Collective |
| GigaSpeech XL | 10,000.0 | ESB datasets |
| SPGISpeech | 4,900.0 | ESB datasets |
| TED-LIUM Release 3 | 454.0 | ESB datasets |
| VoxPopuli | 523.0 | ESB datasets |
| Total | ~18,700 | ESB datasets and Mozilla Data Collective |
Common Voice Scripted Speech 25.0 - English and Common Voice Spontaneous Speech 3.0 - English are the English scripted and spontaneous Common Voice releases checked on May 12, 2026. Mozilla Data Collective lists them with March 30, 2026 and March 22, 2026 release dates, respectively.
For data normalization, the training transcripts were processed with direct LLM normalization using a self-hosted GLM-4.7 setup, plus custom agentic workflows built in collaboration with Claude Code 4.7 and Codex 5.4.
Training
The model was trained with the k2-fsa/Icefall framework, PyTorch 2.10, and CUDA 13.0 for 35 epochs on 8 NVIDIA B200 GPUs using bf16 automatic mixed precision. Training took approximately 4 days.
Usage With Icefall
From an Icefall checkout, download this model repo locally:
huggingface-cli download soundsgoodai/Zipformer-transducer-XL-290M \
--local-dir Zipformer-transducer-XL-290M
Run offline decoding through Icefall. Input audio must already be 16 kHz:
cd icefall/egs/librispeech/ASR
PYTHONPATH=../../.. python zipformer/pretrained.py \
--checkpoint /path/to/Zipformer-transducer-XL-290M/model.pt \
--tokens /path/to/Zipformer-transducer-XL-290M/tokens.txt \
--method modified_beam_search \
--beam-size 6 \
--num-encoder-layers "2,2,4,5,4,2" \
--downsampling-factor "1,2,4,8,4,2" \
--feedforward-dim "512,1024,2048,3072,2048,1024" \
--num-heads "4,4,4,8,4,4" \
--encoder-dim "192,384,768,1024,768,384" \
--encoder-unmasked-dim "192,192,320,384,320,192" \
--query-head-dim "32" \
--value-head-dim "12" \
--pos-head-dim "4" \
--pos-dim 48 \
--cnn-module-kernel "31,31,15,15,15,31" \
--decoder-dim 512 \
--joiner-dim 512 \
--context-size 2 \
--causal false \
--chunk-size "16,32,64,-1" \
--left-context-frames "64,128,256,-1" \
--use-transducer true \
--use-ctc false \
--use-attention-decoder false \
--use-cr-ctc false \
/path/to/audio_1.wav \
/path/to/audio_2.wav
The architecture and decoding values above are also recorded in
config.yaml.
Decoding Methods
The model supports the following Icefall decoding methods:
greedy_searchmodified_beam_searchfast_beam_search
The reported Open ASR Leaderboard result uses modified_beam_search with beam
size 6.
Feature Extraction And Resampling
For this model, use Kaldi-style fbank features. kaldifeat and
kaldi-native-fbank are the recommended feature extraction backends.
Audio should be mono 16 kHz before feature extraction. For sample-rate
conversion, audioop.ratecv is recommended because it matches the resampling
path used for evaluation. On Python 3.13 and newer, use an audioop-compatible
package such as audioop-lts.
import audioop
import numpy as np
def resample_to_16k_audioop(
audio: np.typing.NDArray[np.float32],
source_sample_rate: int,
) -> np.typing.NDArray[np.float32]:
target_sample_rate = 16000
max_signed_int = np.float32(32768.0)
# Expected shape and dtype: (num_samples,), mono float32 audio in [-1.0, 1.0].
if audio.ndim != 1:
raise ValueError(f"Expected 1-D mono audio, got shape {audio.shape}.")
if audio.dtype != np.float32:
raise TypeError(f"Expected float32 audio, got {audio.dtype}.")
audio = np.clip(audio, -1.0, 1.0)
pcm16 = (audio * max_signed_int).astype(np.int16).tobytes()
resampled_pcm16, _ = audioop.ratecv(
pcm16,
2, # int16 sample width in bytes
1, # mono
source_sample_rate,
target_sample_rate,
None,
)
resampled_audio = np.frombuffer(resampled_pcm16, dtype=np.int16).astype(np.float32)
resampled_audio /= max_signed_int
return resampled_audio
Output Formatting
The model emits normalized English text with punctuation and capitalization. It does not automatically capitalize the first word of every sentence unless that word is normally capitalized, such as a proper noun, honorific, acronym, or similar named expression.
The model also normalizes common written forms, including numbers, dates, and currency, to digit-based forms.
Open ASR Leaderboard Evaluation
The open_asr_leaderboard/soundsgoodai runner downloads this model repo,
resamples audio to 16 kHz with audioop, computes fbank features with
kaldi-native-fbank, and decodes with Icefall modified_beam_search.
cd open_asr_leaderboard/soundsgoodai
bash run_zipformer.sh
Notes
- The model expects 16 kHz mono audio.
- The packaged decoding configuration uses
modified_beam_searchwith beam size 6. - The checkpoint is an offline ASR model and is not intended for streaming decoding.
References
- Zipformer paper: Zengwei Yao, Liyong Guo, Xiaoyu Yang, Wei Kang, Fangjun Kuang, Yifan Yang, Zengrui Jin, Long Lin, and Daniel Povey. "Zipformer: A faster and better encoder for automatic speech recognition." https://arxiv.org/abs/2310.11230
- Icefall: https://github.com/k2-fsa/icefall
- k2: https://github.com/k2-fsa/k2
- kaldifeat: https://csukuangfj.github.io/kaldifeat/intro.html
- kaldi-native-fbank: https://github.com/csukuangfj/kaldi-native-fbank
- Python
audioop.ratecv: https://docs.python.org/3.11/library/audioop.html#audioop.ratecv - audioop-lts for Python 3.13+: https://pypi.org/project/audioop-lts/
- ESB datasets: https://huggingface.co/datasets/esb/datasets
- LibriSpeech: https://www.openslr.org/12/
- Earnings22: https://github.com/revdotcom/speech-datasets/tree/main/earnings22
- AMI Meeting Corpus: https://www.idiap.ch/webarchives/sites/www.amiproject.org/ami-scientific-portal/meeting-corpus/
- Common Voice Scripted Speech 25.0 - English: https://mozilladatacollective.com/datasets/cmndapwry02jnmh07dyo46mot
- Common Voice Spontaneous Speech 3.0 - English: https://datacollective.mozillafoundation.org/datasets/cmn1pv5hi00uto1072y1074y7
- GigaSpeech: https://github.com/SpeechColab/GigaSpeech
- SPGISpeech: https://huggingface.co/datasets/kensho/spgispeech
- TED-LIUM Release 3: https://lium.univ-lemans.fr/en/ted-lium3/
- VoxPopuli: https://github.com/facebookresearch/voxpopuli
- Downloads last month
- 38
Dataset used to train soundsgoodai/Zipformer-transducer-XL-290M
Paper for soundsgoodai/Zipformer-transducer-XL-290M
Evaluation results
- Mean Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
6.79 - Rtfx on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
100.06 - Ami Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
14.43 - Earnings22 Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
- Gigaspeech Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
10.46 - Librispeech Clean Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
2.08 - Librispeech Other Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
5.03 - Spgispeech Wer on hf-audio/open-asr-leaderboard View evaluation results source leaderboard
2.08