TensorRT
ONNX
asr
quantization
edge
embedl
Optimized by Embedl
Need to fine-tune, hit performance targets, or deploy on specific hardware?
We've got you covered.
Learn more Get in touch β†’

Embedl Parakeet Tdt 0.6B V3 (Quantized for TensorRT)

Deployable INT8-quantized version of nvidia/parakeet-tdt-0.6b-v3, optimized with embedl-deploy for low-latency NVIDIA TensorRT speech recognition on edge GPUs.

Upstream Model

Open nvidia/parakeet-tdt-0.6b-v3 in hfviewer

Highlights

  • Mixed-precision INT8/FP16 quantization of the Conformer encoder via embedl-deploy β€” the TDT decoder stays in FP32 (small, autoregressive).
  • Drop-in replacement for the upstream nvidia/parakeet-tdt-0.6b-v3 encoder β€” same log-mel input (log-mel spectrogram (3000 frames Γ— 128 bins)).
  • Validated accuracy within 0.15 pp of the FP32 baseline on Open ASR Leaderboard.
  • Ships ONNX (TensorRT) and a runnable inference scripts. The first-build TRT engine is cached next to the ONNX.

Quick Start

Requires an NVIDIA GPU with driver β‰₯ 525 (CUDA 12.x). Install the CUDA 12 builds of PyTorch and TensorRT explicitly β€” the unqualified torch and tensorrt packages on PyPI default to CUDA 13 and will fail on CUDA 12 drivers:

pip install huggingface_hub transformers soundfile librosa
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install tensorrt-cu12
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt', local_dir='.')"
python infer_trt.py --audio path/to/speech.wav

Files

File Purpose
embedl_parakeet-tdt-0.6b-v3_int8.onnx INT8-quantized encoder ONNX with Q/DQ nodes.
infer_trt.py Build a TRT engine from the ONNX and transcribe a WAV.

Demo: Nehru "Tryst with Destiny" (1947)

A 4-minute archival speech in English with strong regional accent and period audio quality β€” a stress test for any modern ASR model. The demo MP3 is decoded to 16 kHz mono and split into 28 s chunks (below the 30 s encoder window) before being fed through the INT8-quantized encoder.

Mel-spectrogram of the Nehru Tryst-with-Destiny speech

Result (Embedl Parakeet INT8 encoder + upstream TDT decoder, against the verified ground-truth transcript β€” Whisper-style normalized):

Metric Value
Audio duration 280.9 s (4 min 41 s)
Ground-truth words 520
Parakeet hypothesis words 526
Word Error Rate 5.58 %

Parakeet INT8 output transcript is provided for direct comparison.

Try it yourself:

curl -O https://huggingface.co/datasets/embedl/documentation-images/resolve/main/parakeet-tdt-0.6b-v3-quantized-tensorrt/nehru_tryst.mp3
python infer_trt.py --audio nehru_tryst.mp3

Performance

Encoder benchmarked via trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, 1000 iterations after a 2 s warm-up. Input is a static (1, 3000, 128) log-mel spectrogram (30 s window). Engine size is the on-disk .engine file; peak GPU memory is the engine plus the per-context activation pool reported by trtexec.

NVIDIA L4 (TensorRT 10.16)

Precision Mean Latency p95 Throughput Engine Size Peak GPU Memory Speedup vs FP32
FP32 32.82 ms 33.62 ms 30.5 qps 2364 MiB 2570 MiB 1.00Γ—
FP16 16.83 ms 17.10 ms 59.4 qps 1177 MiB 1280 MiB 1.95Γ—
Embedl INT8 + FP16 14.57 ms 14.74 ms 68.6 qps 758 MiB 862 MiB 2.25Γ—

NVIDIA Jetson AGX Orin (TensorRT 10.3, JetPack 6)

Precision Mean Latency p95 Throughput Engine Size Peak GPU Memory Speedup vs FP32
FP32 62.40 ms 62.46 ms 16.0 qps 2331 MiB 2526 MiB 1.00Γ—
FP16 35.62 ms 35.66 ms 28.1 qps 1174 MiB 1274 MiB 1.75Γ—
Embedl INT8 + FP16 34.32 ms 34.35 ms 29.1 qps 706 MiB 806 MiB 1.82Γ—

The Embedl INT8 + FP16 engine is 2.25Γ— faster than FP32 on L4 and 1.82Γ— faster on AGX Orin, with a 3.1×–3.3Γ— smaller engine than FP32 across both targets.

Accuracy (Open ASR Leaderboard)

Evaluated on the Open ASR Leaderboard test suites with the official Whisper-style English text normalizer (lowercase, number expansion, filler-word removal). Lower WER is better.

Full-dataset FP32 baseline (83,173 samples, 167.9 h audio)

Reference run with the upstream nvidia/parakeet-tdt-0.6b-v3 in FP32:

Dataset WER Samples Audio
AMI 12.02% 12,643 8.7 h
Earnings22 11.83% 2,731 5.3 h
GigaSpeech 9.78% 19,931 35.4 h
LibriSpeech (clean) 1.99% 2,611 5.3 h
LibriSpeech (other) 3.65% 2,932 5.3 h
SPGISpeech 3.87% 39,341 100.0 h
TEDLIUM 3.07% 1,154 2.6 h
VoxPopuli 6.13% 1,830 4.8 h
Average 6.54% 83,173 167.9 h

FP32 vs Embedl INT8 β€” matched 500 samples per dataset

Both paths evaluated on identical sample indices (evenly spaced across each dataset) to isolate quantization accuracy loss from sampling variance. Lower WER is better.

Dataset FP32 WER Embedl INT8 WER Ξ” WER
AMI 12.54% 13.54% +1.00%
Earnings22 12.55% 11.72% βˆ’0.83%
GigaSpeech 9.97% 10.21% +0.23%
LibriSpeech (clean) 2.00% 2.10% +0.10%
LibriSpeech (other) 3.32% 3.70% +0.38%
SPGISpeech 4.04% 4.15% +0.11%
TEDLIUM 2.68% 2.76% +0.08%
VoxPopuli 6.07% 6.25% +0.18%
Average 6.65% 6.80% +0.16%

Embedl INT8 quantization adds only +0.16 pp absolute WER on average β€” well within deployment tolerance. The AMI +1.00% outlier is within the expected variance for a 500-sample evaluation on spontaneous meeting speech (the highest natural variance of all 8 datasets). The Earnings22 βˆ’0.83% is a sampling artifact (different 500-sample distribution between the two measures).

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β†’ TensorRT deployment library. You can apply the same workflow to your own models β€” see the documentation for installation and usage.

License

Component License
Optimized model artifacts (this repo) Embedl Models Community Licence v1.0 β€” no redistribution as a hosted service
Upstream architecture and weights Parakeet Tdt 0.6B V3 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support
Need help with this model? Chat with the Embedl team and other engineers on Discord.
Quantization gotchas, hardware questions, fine-tuning tips β€” bring them all.
Join our Discord β†’
Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. πŸ™‹ Ask for provider support

Model tree for embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt

Quantized
(37)
this model