Instructions to use nvidia/parakeet-tdt-0.6b-v3 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use nvidia/parakeet-tdt-0.6b-v3 with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("automatic-speech-recognition", model="nvidia/parakeet-tdt-0.6b-v3")# Load model directly from transformers import AutoModelForMultimodalLM model = AutoModelForMultimodalLM.from_pretrained("nvidia/parakeet-tdt-0.6b-v3", dtype="auto") - Inference
- Notebooks
- Google Colab
- Kaggle
The model sometimes drops full sentences
I'm using nvidia/parakeet-tdt-0.6b-v3 for offline transcription of a 120 s English clip (16 kHz mono, single speaker, clean studio audio — TED-LIUM style). Two short sentences are consistently missing from the output, both in:
The official nemo_toolkit[asr] pipeline (single-pass over the full audio, greedy_batch decoding, bf16 autocast on CUDA).
An ONNX-exported port of the same model.
Specifically the model emits no tokens for these spans:
"So the climate changes will be terrible for them."
"Really advanced civilization is based on advances in energy."
Both are clearly intelligible in the input audio and other ASR systems (Whisper, Canary) transcribe them correctly. There are no silences or background noise around the missing spans.
Repro (NeMo, single-pass over full audio):
import torch, soundfile as sf
import nemo.collections.asr as nemo_asr
from omegaconf import open_dict
asr = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-tdt-0.6b-v3").eval().cuda()
cfg = asr.cfg.decoding
with open_dict(cfg): cfg.strategy = "greedy_batch"
asr.change_decoding_strategy(cfg)
audio, sr = sf.read("clip.wav", dtype="float32"); assert sr == 16000
x = torch.from_numpy(audio).unsqueeze(0).cuda()
xl = torch.tensor([len(audio)]).cuda()
with torch.inference_mode(), torch.autocast("cuda", dtype=torch.bfloat16):
s, sl = asr.preprocessor(input_signal=x, length=xl)
e, el = asr.encoder(audio_signal=s, length=sl)
print(asr.decoding.rnnt_decoder_predictions_tensor(e, el)[0])
Questions:
Is this a known limitation of the TDT decoder for v3 (e.g., interaction between max_symbols_per_step and the duration head emitting non-zero skips on short utterances)?
Does NVIDIA recommend a non-greedy strategy (alsd / maes / beam) for offline transcription where these drops can be avoided?
Are there any decoding parameters in cfg.decoding that should be tuned for clips with short, fast-paced sentences?