X-ASR zh-TW/en streaming — native Traditional (deployed-quality, no post-processing)
Streaming zh/en code-switch ASR that outputs native Traditional Chinese (zh-TW) + English directly —
no runtime OpenCC step — at the same accuracy as the deployed X-ASR model + OpenCC s2twp.
This is the deployed X-ASR 480 ms int8 streaming punct model (encoder.int8.onnx / decoder.onnx /
joiner.int8.onnx, unchanged) with its tokens.txt relabeled: each Chinese token surface is mapped
Simplified→Taiwan-Traditional with OpenCC s2twp. The model emits the same token IDs as the deployed model;
the relabeled tokenizer renders them as Traditional. The OpenCC conversion is baked into the tokenizer, so
there is zero post-processing at inference time and zero speed cost (it is the deployed model).
Accuracy (500 Common Voice 17 zh-TW clips, Traditional CER)
| Pipeline | Traditional CER | Runtime OpenCC |
|---|---|---|
deployed X-ASR + OpenCC s2twp |
0.0683 | yes (post-step) |
| this model (native Traditional) | 0.0675 | none |
| prior native demo (weak base, relabel) | 0.137 | none |
It matches deployed + s2twp (0.0675 vs 0.0683, within bootstrap noise) and is ~2× better than the
earlier native demo, which was built on a weak base checkpoint. Recognition is identical to the deployed model
(orthography-neutralized CER ≈ 0.064); s2twp adds essentially no orthography error, so a native model cannot
do better — this reaches that ceiling while removing the post-step. Real-time on a Jetson Nano at 2 CPU threads.
Usage (sherpa-onnx)
import sherpa_onnx
rec = sherpa_onnx.OnlineRecognizer.from_transducer(
tokens="tokens.txt", encoder="encoder.int8.onnx",
decoder="decoder.onnx", joiner="joiner.int8.onnx",
num_threads=2, provider="cpu", decoding_method="greedy_search")
# feed 16 kHz mono audio; pad ~2 s trailing silence to flush the streaming chunk.
# output is Traditional zh-TW + English, no OpenCC needed.
Caveats
- Rare context-dependent one-to-many characters (e.g. 拉麵 vs 麵/面) use OpenCC's default per-character
mapping; net accuracy is equal to phrase-aware
s2twpon everyday speech. - Recipe, benchmarks, and the negative results behind this design (fine-tuning can't beat the deployed model; a native model can only match it): github.com/vieenrose/jetson-stt.