Automatic Speech Recognition
NeMo
PyTorch
speech-recognition
unified-asr
offline-asr
streaming-asr
speech
audio
FastConformer
RNNT
Parakeet
ASR
NeMo
Eval Results (legacy)
Instructions to use nvidia/parakeet-unified-en-0.6b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- NeMo
How to use nvidia/parakeet-unified-en-0.6b with NeMo:
import nemo.collections.asr as nemo_asr asr_model = nemo_asr.models.ASRModel.from_pretrained("nvidia/parakeet-unified-en-0.6b") transcriptions = asr_model.transcribe(["file.wav"]) - Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -241,7 +241,7 @@ pipeline_tag: automatic-speech-recognition
|
|
| 241 |
| [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
|
| 242 |
|---|---|---|
|
| 243 |
|
| 244 |
-
Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model. It is trained mostly on the English part of the Granary dataset [
|
| 245 |
|
| 246 |
<figure align="center">
|
| 247 |
<img src="figures/wer_comparison.png" width="1250" />
|
|
@@ -284,9 +284,7 @@ This model is for transcription of English audio in offline and streaming modes.
|
|
| 284 |
|
| 285 |
**Architecture Type:** Unified-FastConformer-RNNT
|
| 286 |
|
| 287 |
-
The model is based on the FastConformer encoder architecture [
|
| 288 |
-
|
| 289 |
-
The paper with the details of the model architecture and training will be released soon.
|
| 290 |
|
| 291 |
**Network Architecture:**
|
| 292 |
|
|
@@ -391,7 +389,7 @@ We would recommend to use the following context parameters for different latenci
|
|
| 391 |
|
| 392 |
### Training Datasets
|
| 393 |
|
| 394 |
-
The majority of the training data comes from the English portion of the Granary dataset [
|
| 395 |
|
| 396 |
- YouTube-Commons (YTC) (109.5k hours)
|
| 397 |
- YODAS2 (102k hours)
|
|
@@ -482,12 +480,12 @@ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concern
|
|
| 482 |
|
| 483 |
## References
|
| 484 |
|
| 485 |
-
|
| 486 |
|
| 487 |
-
[
|
| 488 |
|
| 489 |
-
[
|
| 490 |
|
| 491 |
-
[
|
| 492 |
|
| 493 |
-
[
|
|
|
|
| 241 |
| [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
|
| 242 |
|---|---|---|
|
| 243 |
|
| 244 |
+
Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model [1]. It is trained mostly on the English part of the Granary dataset [4], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
|
| 245 |
|
| 246 |
<figure align="center">
|
| 247 |
<img src="figures/wer_comparison.png" width="1250" />
|
|
|
|
| 284 |
|
| 285 |
**Architecture Type:** Unified-FastConformer-RNNT
|
| 286 |
|
| 287 |
+
The unified model architecture is presented in [1]. The model is based on the FastConformer encoder architecture [2] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [3] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
|
|
|
|
|
|
|
| 288 |
|
| 289 |
**Network Architecture:**
|
| 290 |
|
|
|
|
| 389 |
|
| 390 |
### Training Datasets
|
| 391 |
|
| 392 |
+
The majority of the training data comes from the English portion of the Granary dataset [4]:
|
| 393 |
|
| 394 |
- YouTube-Commons (YTC) (109.5k hours)
|
| 395 |
- YODAS2 (102k hours)
|
|
|
|
| 480 |
|
| 481 |
## References
|
| 482 |
|
| 483 |
+
[1] [Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization](https://arxiv.org/abs/2604.19079)
|
| 484 |
|
| 485 |
+
[2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
|
| 486 |
|
| 487 |
+
[3] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
|
| 488 |
|
| 489 |
+
[4] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
|
| 490 |
|
| 491 |
+
[5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
|