Embedl Parakeet Tdt 0.6B V3 (Quantized for TensorRT)

Deployable INT8-quantized version of nvidia/parakeet-tdt-0.6b-v3, optimized with embedl-deploy for low-latency NVIDIA TensorRT speech recognition on edge GPUs.

Upstream Model

Open nvidia/parakeet-tdt-0.6b-v3 in hfviewer

Highlights

Mixed-precision INT8/FP16 quantization of the Conformer encoder via embedl-deploy — the TDT decoder stays in FP32 (small, autoregressive).
Drop-in replacement for the upstream nvidia/parakeet-tdt-0.6b-v3 encoder — same log-mel input (log-mel spectrogram (3000 frames × 128 bins)).
Validated accuracy within 0.15 pp of the FP32 baseline on Open ASR Leaderboard.
Ships ONNX (TensorRT) and a runnable inference scripts. The first-build TRT engine is cached next to the ONNX.

Quick Start

Requires an NVIDIA GPU with driver ≥ 525 (CUDA 12.x). Install the CUDA 12 builds of PyTorch and TensorRT explicitly — the unqualified torch and tensorrt packages on PyPI default to CUDA 13 and will fail on CUDA 12 drivers:

pip install huggingface_hub transformers soundfile librosa
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install tensorrt-cu12
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt', local_dir='.')"
python infer_trt.py --audio path/to/speech.wav

Files

File	Purpose
`embedl_parakeet-tdt-0.6b-v3_int8.onnx`	INT8-quantized encoder ONNX with Q/DQ nodes.
`infer_trt.py`	Build a TRT engine from the ONNX and transcribe a WAV.

Demo: Nehru "Tryst with Destiny" (1947)

A 4-minute archival speech in English with strong regional accent and period audio quality — a stress test for any modern ASR model. The demo MP3 is decoded to 16 kHz mono and split into 28 s chunks (below the 30 s encoder window) before being fed through the INT8-quantized encoder.

Mel-spectrogram of the Nehru Tryst-with-Destiny speech

Result (Embedl Parakeet INT8 encoder + upstream TDT decoder, against the verified ground-truth transcript — Whisper-style normalized):

Metric	Value
Audio duration	280.9 s (4 min 41 s)
Ground-truth words	520
Parakeet hypothesis words	526
Word Error Rate	5.58 %

Parakeet INT8 output transcript is provided for direct comparison.

Try it yourself:

curl -O https://huggingface.co/datasets/embedl/documentation-images/resolve/main/parakeet-tdt-0.6b-v3-quantized-tensorrt/nehru_tryst.mp3
python infer_trt.py --audio nehru_tryst.mp3

Performance

Encoder benchmarked via trtexec, GPU compute time only (--noDataTransfers), CUDA Graph + Spin Wait enabled, 1000 iterations after a 2 s warm-up. Input is a static (1, 3000, 128) log-mel spectrogram (30 s window). Engine size is the on-disk .engine file; peak GPU memory is the engine plus the per-context activation pool reported by trtexec.

NVIDIA L4 (TensorRT 10.16)

Precision	Mean Latency	p95	Throughput	Engine Size	Peak GPU Memory	Speedup vs FP32
FP32	32.82 ms	33.62 ms	30.5 qps	2364 MiB	2570 MiB	1.00×
FP16	16.83 ms	17.10 ms	59.4 qps	1177 MiB	1280 MiB	1.95×
Embedl INT8 + FP16	14.57 ms	14.74 ms	68.6 qps	758 MiB	862 MiB	2.25×

NVIDIA Jetson AGX Orin (TensorRT 10.3, JetPack 6)

Precision	Mean Latency	p95	Throughput	Engine Size	Peak GPU Memory	Speedup vs FP32
FP32	62.40 ms	62.46 ms	16.0 qps	2331 MiB	2526 MiB	1.00×
FP16	35.62 ms	35.66 ms	28.1 qps	1174 MiB	1274 MiB	1.75×
Embedl INT8 + FP16	34.32 ms	34.35 ms	29.1 qps	706 MiB	806 MiB	1.82×

The Embedl INT8 + FP16 engine is 2.25× faster than FP32 on L4 and 1.82× faster on AGX Orin, with a 3.1×–3.3× smaller engine than FP32 across both targets.

Accuracy (Open ASR Leaderboard)

Evaluated on the Open ASR Leaderboard test suites with the official Whisper-style English text normalizer (lowercase, number expansion, filler-word removal). Lower WER is better.

Full-dataset FP32 baseline (83,173 samples, 167.9 h audio)

Reference run with the upstream nvidia/parakeet-tdt-0.6b-v3 in FP32:

Dataset	WER	Samples	Audio
AMI	12.02%	12,643	8.7 h
Earnings22	11.83%	2,731	5.3 h
GigaSpeech	9.78%	19,931	35.4 h
LibriSpeech (clean)	1.99%	2,611	5.3 h
LibriSpeech (other)	3.65%	2,932	5.3 h
SPGISpeech	3.87%	39,341	100.0 h
TEDLIUM	3.07%	1,154	2.6 h
VoxPopuli	6.13%	1,830	4.8 h
Average	6.54%	83,173	167.9 h

FP32 vs Embedl INT8 — matched 500 samples per dataset

Both paths evaluated on identical sample indices (evenly spaced across each dataset) to isolate quantization accuracy loss from sampling variance. Lower WER is better.

Dataset	FP32 WER	Embedl INT8 WER	Δ WER
AMI	12.54%	13.54%	+1.00%
Earnings22	12.55%	11.72%	−0.83%
GigaSpeech	9.97%	10.21%	+0.23%
LibriSpeech (clean)	2.00%	2.10%	+0.10%
LibriSpeech (other)	3.32%	3.70%	+0.38%
SPGISpeech	4.04%	4.15%	+0.11%
TEDLIUM	2.68%	2.76%	+0.08%
VoxPopuli	6.07%	6.25%	+0.18%
Average	6.65%	6.80%	+0.16%

Embedl INT8 quantization adds only +0.16 pp absolute WER on average — well within deployment tolerance. The AMI +1.00% outlier is within the expected variance for a 500-sample evaluation on spontaneous meeting speech (the highest natural variance of all 8 datasets). The Earnings22 −0.83% is a sampling artifact (different 500-sample distribution between the two measures).

Creating Your Own Optimized Models

This artifact was produced with embedl-deploy, Embedl's open-source PyTorch → TensorRT deployment library. You can apply the same workflow to your own models — see the documentation for installation and usage.

License

Component	License
Optimized model artifacts (this repo)	Embedl Models Community Licence v1.0 — no redistribution as a hosted service
Upstream architecture and weights	Parakeet Tdt 0.6B V3 License

Contact

We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.

Community & support

Need help with this model? Chat with the Embedl team and other engineers on Discord.

Quantization gotchas, hardware questions, fine-tuning tips — bring them all.

Join our Discord →

Downloads last month: 19

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt

Base model

nvidia/parakeet-tdt-0.6b-v3

Quantized

(37)

this model