How to use from the
Use from the
NeMo library
import nemo.collections.asr as nemo_asr
asr_model = nemo_asr.models.ASRModel.from_pretrained("Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp")

transcriptions = asr_model.transcribe(["file.wav"])

VisualEars FA32M Streaming BPE1024 β€” LiteRT FP

LiteRT/TFLite fixed-frame acoustic CTC-core export of Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024.

This is the FA32M length-aware core: it accepts precomputed NeMo-compatible log-mel features plus the real valid feature length, so short utterances do not get decoded as if all 2005 padded frames were valid.

Runtime contract

  • input 0 (serving_default_args_0): processed_signal float32 [1, 80, 2005]
  • input 1 (serving_default_args_1): processed_signal_length int64 [1] β€” valid log-mel frame count before zero padding
  • output 0 (serving_default_output_0_output): logits float32 [1, 252, 1025]
  • output 1 (serving_default_output_1_output): encoded_lengths int64 [1]
  • tokenizer blank id: 1024

Artifact

  • File: fastconformer_fa32m_ctc_fixed2005_len_fp.tflite
  • Size: 110,574,440 bytes
  • SHA256: ae671928398d98ad86a67926d60c53b8885e2224ba8b7beea3318718afb9bb84

269-clip transcription parity

Source: PyTorch NeMo preprocessor + encoder + auxiliary CTC fp32, decoded during calibration export.
Candidate: this LiteRT/TFLite model through ai_edge_litert XNNPACK CPU.

Validation set: all 269 clips from Reza2kn/visualears-benchmark-269-gold.

Metric Result
Exact transcript matches 269 / 269
Exact transcript parity 100.00%
Exact normalized transcript parity 100.00%
Mean character similarity 100.00%
Candidate non-empty rate 98.88%
Source non-empty rate 98.88%
Encoded length match rate 100.00%

Result: passes the >98% transcription parity gate.

Feature contract

Use the sidecars preprocessor.json and mel_filters_slaney_80x257.json:

  • sample rate: 16 kHz mono
  • preemphasis: 0.97
  • STFT: n_fft=512, win_length=400, hop_length=160, centered with reflect padding
  • mel: Slaney/librosa 80-bin filterbank from sidecar
  • log: natural log with tiny floor
  • no per-bin normalization (normalize=NA)
  • zero-pad/truncate features to 2005 frames, and pass true processed_signal_length

Files

  • fastconformer_fa32m_ctc_fixed2005_len_fp.tflite β€” LiteRT/TFLite model
  • tokens.json β€” tokenizer pieces + blank id
  • preprocessor.json β€” feature settings
  • mel_filters_slaney_80x257.json β€” browser/runtime-compatible mel filters
  • validation/parity_full269_litert_fp_fp16.json β€” full transcript parity for FP and FP16
  • validation/fa32m_litert_export_manifest.json β€” calibration/export manifest
  • scripts/ β€” export, conversion, quantization, and parity scripts

Provenance / conversion notes

  • Source model: Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024 / fa32m_streaming_bpe1024_final.nemo
  • Source SHA256: 034fb2afa19da13db8a120970a7f8d3e696987014cc62684ce50a1382d332448
  • Conversion: NeMo CTC encoder/auxiliary decoder β†’ TorchScript β†’ litert_torch β†’ LiteRT/TFLite.
  • LiteRT workaround: relative positional encoding was fixed to the known 2005-frame contract to avoid dynamic scalar lowering in litert_torch; processed_signal_length remains a runtime input and drives padding/attention masking plus encoded_lengths.
Downloads last month
5
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for Reza2kn/visualears-fastconformer-fa32m-streaming-bpe1024-litert-fp