X-ASR zh-TW/en streaming — native Traditional (deployed-quality, no post-processing)

Streaming zh/en code-switch ASR that outputs native Traditional Chinese (zh-TW) + English directly — no runtime OpenCC step — at the same accuracy as the deployed X-ASR model + OpenCC s2twp.

This is the deployed X-ASR 480 ms int8 streaming punct model (encoder.int8.onnx / decoder.onnx / joiner.int8.onnx, unchanged) with its tokens.txt relabeled: each Chinese token surface is mapped Simplified→Taiwan-Traditional with OpenCC s2twp. The model emits the same token IDs as the deployed model; the relabeled tokenizer renders them as Traditional. The OpenCC conversion is baked into the tokenizer, so there is zero post-processing at inference time and zero speed cost (it is the deployed model).

Accuracy (500 Common Voice 17 zh-TW clips, Traditional CER)

Pipeline	Traditional CER	Runtime OpenCC
deployed X-ASR + OpenCC `s2twp`	0.0683	yes (post-step)
this model (native Traditional)	0.0675	none
prior native demo (weak base, relabel)	0.137	none

It matches deployed + s2twp (0.0675 vs 0.0683, within bootstrap noise) and is ~2× better than the earlier native demo, which was built on a weak base checkpoint. Recognition is identical to the deployed model (orthography-neutralized CER ≈ 0.064); s2twp adds essentially no orthography error, so a native model cannot do better — this reaches that ceiling while removing the post-step. Real-time on a Jetson Nano at 2 CPU threads.

Usage (sherpa-onnx)

import sherpa_onnx
rec = sherpa_onnx.OnlineRecognizer.from_transducer(
    tokens="tokens.txt", encoder="encoder.int8.onnx",
    decoder="decoder.onnx", joiner="joiner.int8.onnx",
    num_threads=2, provider="cpu", decoding_method="greedy_search")
# feed 16 kHz mono audio; pad ~2 s trailing silence to flush the streaming chunk.
# output is Traditional zh-TW + English, no OpenCC needed.

Caveats

Rare context-dependent one-to-many characters (e.g. 拉麵 vs 麵/面) use OpenCC's default per-character mapping; net accuracy is equal to phrase-aware s2twp on everyday speech.
Recipe, benchmarks, and the negative results behind this design (fine-tuning can't beat the deployed model; a native model can only match it): github.com/vieenrose/jetson-stt.

Downloads last month: -; Downloads are not tracked for this model. How to track

Luigi
/

x-asr-zh-tw-en-streaming-native-demo

X-ASR zh-TW/en streaming — native Traditional (deployed-quality, no post-processing)

Accuracy (500 Common Voice 17 zh-TW clips, Traditional CER)

Usage (sherpa-onnx)

Caveats

Space using Luigi/x-asr-zh-tw-en-streaming-native-demo 1