The model sometimes drops full sentences

#42

by nenad1002 - opened May 8

May 8

I'm using nvidia/parakeet-tdt-0.6b-v3 for offline transcription of a 120 s English clip (16 kHz mono, single speaker, clean studio audio — TED-LIUM style). Two short sentences are consistently missing from the output, both in:

The official nemo_toolkit[asr] pipeline (single-pass over the full audio, greedy_batch decoding, bf16 autocast on CUDA).
An ONNX-exported port of the same model.
Specifically the model emits no tokens for these spans:

"So the climate changes will be terrible for them."
"Really advanced civilization is based on advances in energy."
Both are clearly intelligible in the input audio and other ASR systems (Whisper, Canary) transcribe them correctly. There are no silences or background noise around the missing spans.

Repro (NeMo, single-pass over full audio):
import torch, soundfile as sf
import nemo.collections.asr as nemo_asr
from omegaconf import open_dict

asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3").eval().cuda()
cfg = asr.cfg.decoding
with open_dict(cfg): cfg.strategy = "greedy_batch"
asr.change_decoding_strategy(cfg)

audio, sr = sf.read("clip.wav", dtype="float32"); assert sr == 16000
x = torch.from_numpy(audio).unsqueeze(0).cuda()
xl = torch.tensor([len(audio)]).cuda()

with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
s, sl = asr.preprocessor(input_signal=x, length=xl)
e, el = asr.encoder(audio_signal=s, length=sl)
print(asr.decoding.rnnt_decoder_predictions_tensor(e, el)[0])

Questions:
Is this a known limitation of the TDT decoder for v3 (e.g., interaction between max_symbols_per_step and the duration head emitting non-zero skips on short utterances)?
Does NVIDIA recommend a non-greedy strategy (alsd / maes / beam) for offline transcription where these drops can be avoided?
Are there any decoding parameters in cfg.decoding that should be tuned for clips with short, fast-paced sentences?

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

· Sign up or log in to comment