stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.1-sherpa
Streaming Arabic FastConformer (RNNT decoder), exported as the sherpa-onnx
3-file split so sherpa-onnx's
OnlineRecognizer can run it on iOS / Android / desktop with cache-aware
encoder state propagation.
Mobile counterpart of
β¦-v1.1-mirror
β see the mirror repo's README for training recipe, dataset, and results
(in short: val_wer 1.50% at epoch 7, ~9Γ better than v1's 13.23%).
Geometry
att_context_size = [70, 13]- Left context: 5.6 s of past audio (memory window, no extra latency)
- Right context (lookahead): 1.04 s β constant emission delay
- Encoder subsampling: 8Γ (10 ms frames β 80 ms output steps)
Files
| file | size | purpose |
|---|---|---|
encoder.onnx |
456 MB | streaming encoder with cache I/O slots |
decoder.onnx |
16 MB | RNNT prediction network |
joiner.onnx |
5.6 MB | joint network |
tokens.txt |
13 KB | SentencePiece vocab (token\tid per line) |
silero_vad.onnx |
2.3 MB | bundled Silero VAD (offline-fallback path) |
STREAMING.marker |
<1 KB | flag file so consumers can detect streaming bundle |
README.md |
β | this file |
Total: ~480 MB. fp32 weights β int8 quantization (QDQ static, calibrated) is on the roadmap.
Usage (Dart / sherpa-onnx)
final transducer = sherpa.OnlineTransducerModelConfig(
encoder: 'encoder.onnx',
decoder: 'decoder.onnx',
joiner: 'joiner.onnx',
);
final model = sherpa.OnlineModelConfig(
transducer: transducer,
tokens: 'tokens.txt',
modelType: 'transducer',
provider: 'cpu',
);
final recognizer = sherpa.OnlineRecognizer(sherpa.OnlineRecognizerConfig(
model: model,
decodingMethod: 'greedy_search', // beam=1
enableEndpoint: true,
rule1MinTrailingSilence: 2.4,
rule2MinTrailingSilence: 1.2,
rule3MinUtteranceLength: 30.0,
));
final stream = recognizer.createStream();
stream.acceptWaveform(samples: micFrame, sampleRate: 16000);
while (recognizer.isReady(stream)) recognizer.decode(stream);
print(recognizer.getResult(stream).text);
Usage (Python / sherpa-onnx)
import sherpa_onnx
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
encoder="encoder.onnx",
decoder="decoder.onnx",
joiner="joiner.onnx",
tokens="tokens.txt",
num_threads=2,
provider="cpu",
decoding_method="greedy_search",
)
stream = recognizer.create_stream()
# stream.accept_waveform(16000, samples) # repeatedly
# while recognizer.is_ready(stream): recognizer.decode_stream(stream)
# print(recognizer.get_result(stream))
Why a separate joiner.onnx?
sherpa-onnx's OnlineTransducerModelConfig calls the joiner separately
from the prediction network so the predictor state can be carried between
frames without re-running the joint. NeMo's default export bundles them
into a single decoder_joint.onnx β this repo's files were exported via
NeMo's per-submodule .export() to get the three-way split.
Decoding mode
Greedy only (beam = 1). Murattil's product invariant is faithful acoustic transcription so that the on-device mistake-detection layer sees what the user actually said, not what an LM thinks they meant.
Limitations
See the β¦-v1.1-mirror README for the full list. Most relevant for mobile:
- fp32 download is 456 MB. Wi-Fi-recommended for first launch.
- Quran-only training. Don't expect strong results on general MSA or dialect.
- No mujawwad coverage β by design.
License
CC-BY-4.0 (inherits NVIDIA's base model license).