Instructions to use embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- TensorRT
How to use embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt with TensorRT:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
Embedl Parakeet Tdt 0.6B V3 (Quantized for TensorRT)
Deployable INT8-quantized version of nvidia/parakeet-tdt-0.6b-v3,
optimized with embedl-deploy
for low-latency NVIDIA TensorRT speech recognition on edge GPUs.
Upstream Model
Highlights
- Mixed-precision INT8/FP16 quantization of the Conformer encoder via embedl-deploy β the TDT decoder stays in FP32 (small, autoregressive).
- Drop-in replacement for the upstream
nvidia/parakeet-tdt-0.6b-v3encoder β same log-mel input (log-mel spectrogram (3000 frames Γ 128 bins)). - Validated accuracy within 0.15 pp of the FP32 baseline on Open ASR Leaderboard.
- Ships ONNX (TensorRT) and a runnable inference scripts. The first-build TRT engine is cached next to the ONNX.
Quick Start
Requires an NVIDIA GPU with driver β₯ 525 (CUDA 12.x). Install the CUDA 12
builds of PyTorch and TensorRT explicitly β the unqualified torch and
tensorrt packages on PyPI default to CUDA 13 and will fail on CUDA 12
drivers:
pip install huggingface_hub transformers soundfile librosa
pip install torch --index-url https://download.pytorch.org/whl/cu128
pip install tensorrt-cu12
python -c "from huggingface_hub import snapshot_download; snapshot_download('embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt', local_dir='.')"
python infer_trt.py --audio path/to/speech.wav
Files
| File | Purpose |
|---|---|
embedl_parakeet-tdt-0.6b-v3_int8.onnx |
INT8-quantized encoder ONNX with Q/DQ nodes. |
infer_trt.py |
Build a TRT engine from the ONNX and transcribe a WAV. |
Demo: Nehru "Tryst with Destiny" (1947)
A 4-minute archival speech in English with strong regional accent and period audio quality β a stress test for any modern ASR model. The demo MP3 is decoded to 16 kHz mono and split into 28 s chunks (below the 30 s encoder window) before being fed through the INT8-quantized encoder.
Result (Embedl Parakeet INT8 encoder + upstream TDT decoder, against the verified ground-truth transcript β Whisper-style normalized):
| Metric | Value |
|---|---|
| Audio duration | 280.9 s (4 min 41 s) |
| Ground-truth words | 520 |
| Parakeet hypothesis words | 526 |
| Word Error Rate | 5.58 % |
Parakeet INT8 output transcript is provided for direct comparison.
Try it yourself:
curl -O https://huggingface.co/datasets/embedl/documentation-images/resolve/main/parakeet-tdt-0.6b-v3-quantized-tensorrt/nehru_tryst.mp3
python infer_trt.py --audio nehru_tryst.mp3
Performance
Encoder benchmarked via trtexec, GPU compute time only
(--noDataTransfers), CUDA Graph + Spin Wait enabled, 1000 iterations
after a 2 s warm-up. Input is a static (1, 3000, 128) log-mel
spectrogram (30 s window). Engine size is the on-disk .engine file;
peak GPU memory is the engine plus the per-context activation pool
reported by trtexec.
NVIDIA L4 (TensorRT 10.16)
| Precision | Mean Latency | p95 | Throughput | Engine Size | Peak GPU Memory | Speedup vs FP32 |
|---|---|---|---|---|---|---|
| FP32 | 32.82 ms | 33.62 ms | 30.5 qps | 2364 MiB | 2570 MiB | 1.00Γ |
| FP16 | 16.83 ms | 17.10 ms | 59.4 qps | 1177 MiB | 1280 MiB | 1.95Γ |
| Embedl INT8 + FP16 | 14.57 ms | 14.74 ms | 68.6 qps | 758 MiB | 862 MiB | 2.25Γ |
NVIDIA Jetson AGX Orin (TensorRT 10.3, JetPack 6)
| Precision | Mean Latency | p95 | Throughput | Engine Size | Peak GPU Memory | Speedup vs FP32 |
|---|---|---|---|---|---|---|
| FP32 | 62.40 ms | 62.46 ms | 16.0 qps | 2331 MiB | 2526 MiB | 1.00Γ |
| FP16 | 35.62 ms | 35.66 ms | 28.1 qps | 1174 MiB | 1274 MiB | 1.75Γ |
| Embedl INT8 + FP16 | 34.32 ms | 34.35 ms | 29.1 qps | 706 MiB | 806 MiB | 1.82Γ |
The Embedl INT8 + FP16 engine is 2.25Γ faster than FP32 on L4 and 1.82Γ faster on AGX Orin, with a 3.1Γβ3.3Γ smaller engine than FP32 across both targets.
Accuracy (Open ASR Leaderboard)
Evaluated on the Open ASR Leaderboard test suites with the official Whisper-style English text normalizer (lowercase, number expansion, filler-word removal). Lower WER is better.
Full-dataset FP32 baseline (83,173 samples, 167.9 h audio)
Reference run with the upstream nvidia/parakeet-tdt-0.6b-v3 in FP32:
| Dataset | WER | Samples | Audio |
|---|---|---|---|
| AMI | 12.02% | 12,643 | 8.7 h |
| Earnings22 | 11.83% | 2,731 | 5.3 h |
| GigaSpeech | 9.78% | 19,931 | 35.4 h |
| LibriSpeech (clean) | 1.99% | 2,611 | 5.3 h |
| LibriSpeech (other) | 3.65% | 2,932 | 5.3 h |
| SPGISpeech | 3.87% | 39,341 | 100.0 h |
| TEDLIUM | 3.07% | 1,154 | 2.6 h |
| VoxPopuli | 6.13% | 1,830 | 4.8 h |
| Average | 6.54% | 83,173 | 167.9 h |
FP32 vs Embedl INT8 β matched 500 samples per dataset
Both paths evaluated on identical sample indices (evenly spaced across each dataset) to isolate quantization accuracy loss from sampling variance. Lower WER is better.
| Dataset | FP32 WER | Embedl INT8 WER | Ξ WER |
|---|---|---|---|
| AMI | 12.54% | 13.54% | +1.00% |
| Earnings22 | 12.55% | 11.72% | β0.83% |
| GigaSpeech | 9.97% | 10.21% | +0.23% |
| LibriSpeech (clean) | 2.00% | 2.10% | +0.10% |
| LibriSpeech (other) | 3.32% | 3.70% | +0.38% |
| SPGISpeech | 4.04% | 4.15% | +0.11% |
| TEDLIUM | 2.68% | 2.76% | +0.08% |
| VoxPopuli | 6.07% | 6.25% | +0.18% |
| Average | 6.65% | 6.80% | +0.16% |
Embedl INT8 quantization adds only +0.16 pp absolute WER on average β well within deployment tolerance. The AMI +1.00% outlier is within the expected variance for a 500-sample evaluation on spontaneous meeting speech (the highest natural variance of all 8 datasets). The Earnings22 β0.83% is a sampling artifact (different 500-sample distribution between the two measures).
Creating Your Own Optimized Models
This artifact was produced with embedl-deploy, Embedl's open-source PyTorch β TensorRT deployment library. You can apply the same workflow to your own models β see the documentation for installation and usage.
License
| Component | License |
|---|---|
| Optimized model artifacts (this repo) | Embedl Models Community Licence v1.0 β no redistribution as a hosted service |
| Upstream architecture and weights | Parakeet Tdt 0.6B V3 License |
Contact
We offer engineering support for on-prem/edge deployments and partner co-marketing opportunities. Reach out at contact@embedl.com, or open an issue on GitHub.
- Downloads last month
- 19
Model tree for embedl/parakeet-tdt-0.6b-v3-quantized-tensorrt
Base model
nvidia/parakeet-tdt-0.6b-v3