stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.1-sherpa

Streaming Arabic FastConformer (RNNT decoder), exported as the sherpa-onnx 3-file split so sherpa-onnx's OnlineRecognizer can run it on iOS / Android / desktop with cache-aware encoder state propagation.

Mobile counterpart of …-v1.1-mirror β€” see the mirror repo's README for training recipe, dataset, and results (in short: val_wer 1.50% at epoch 7, ~9Γ— better than v1's 13.23%).

Geometry

  • att_context_size = [70, 13]
  • Left context: 5.6 s of past audio (memory window, no extra latency)
  • Right context (lookahead): 1.04 s β€” constant emission delay
  • Encoder subsampling: 8Γ— (10 ms frames β†’ 80 ms output steps)

Files

file size purpose
encoder.onnx 456 MB streaming encoder with cache I/O slots
decoder.onnx 16 MB RNNT prediction network
joiner.onnx 5.6 MB joint network
tokens.txt 13 KB SentencePiece vocab (token\tid per line)
silero_vad.onnx 2.3 MB bundled Silero VAD (offline-fallback path)
STREAMING.marker <1 KB flag file so consumers can detect streaming bundle
README.md β€” this file

Total: ~480 MB. fp32 weights β€” int8 quantization (QDQ static, calibrated) is on the roadmap.

Usage (Dart / sherpa-onnx)

final transducer = sherpa.OnlineTransducerModelConfig(
  encoder: 'encoder.onnx',
  decoder: 'decoder.onnx',
  joiner:  'joiner.onnx',
);
final model = sherpa.OnlineModelConfig(
  transducer: transducer,
  tokens: 'tokens.txt',
  modelType: 'transducer',
  provider: 'cpu',
);
final recognizer = sherpa.OnlineRecognizer(sherpa.OnlineRecognizerConfig(
  model: model,
  decodingMethod: 'greedy_search', // beam=1
  enableEndpoint: true,
  rule1MinTrailingSilence: 2.4,
  rule2MinTrailingSilence: 1.2,
  rule3MinUtteranceLength: 30.0,
));
final stream = recognizer.createStream();
stream.acceptWaveform(samples: micFrame, sampleRate: 16000);
while (recognizer.isReady(stream)) recognizer.decode(stream);
print(recognizer.getResult(stream).text);

Usage (Python / sherpa-onnx)

import sherpa_onnx
recognizer = sherpa_onnx.OnlineRecognizer.from_transducer(
    encoder="encoder.onnx",
    decoder="decoder.onnx",
    joiner="joiner.onnx",
    tokens="tokens.txt",
    num_threads=2,
    provider="cpu",
    decoding_method="greedy_search",
)
stream = recognizer.create_stream()
# stream.accept_waveform(16000, samples)  # repeatedly
# while recognizer.is_ready(stream): recognizer.decode_stream(stream)
# print(recognizer.get_result(stream))

Why a separate joiner.onnx?

sherpa-onnx's OnlineTransducerModelConfig calls the joiner separately from the prediction network so the predictor state can be carried between frames without re-running the joint. NeMo's default export bundles them into a single decoder_joint.onnx β€” this repo's files were exported via NeMo's per-submodule .export() to get the three-way split.

Decoding mode

Greedy only (beam = 1). Murattil's product invariant is faithful acoustic transcription so that the on-device mistake-detection layer sees what the user actually said, not what an LM thinks they meant.

Limitations

See the …-v1.1-mirror README for the full list. Most relevant for mobile:

  • fp32 download is 456 MB. Wi-Fi-recommended for first launch.
  • Quran-only training. Don't expect strong results on general MSA or dialect.
  • No mujawwad coverage β€” by design.

License

CC-BY-4.0 (inherits NVIDIA's base model license).

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.1-sherpa

Dataset used to train dev-ahmedhany/stt_ar_fastconformer_hybrid_large_streaming_pcd_v1.1-sherpa