aandrusenko commited on
Commit
d4ac992
·
verified ·
1 Parent(s): a37f219

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +8 -10
README.md CHANGED
@@ -241,7 +241,7 @@ pipeline_tag: automatic-speech-recognition
241
  | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
242
  |---|---|---|
243
 
244
- Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
245
 
246
  <figure align="center">
247
  <img src="figures/wer_comparison.png" width="1250" />
@@ -284,9 +284,7 @@ This model is for transcription of English audio in offline and streaming modes.
284
 
285
  **Architecture Type:** Unified-FastConformer-RNNT
286
 
287
- The model is based on the FastConformer encoder architecture [1] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [2] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
288
-
289
- The paper with the details of the model architecture and training will be released soon.
290
 
291
  **Network Architecture:**
292
 
@@ -391,7 +389,7 @@ We would recommend to use the following context parameters for different latenci
391
 
392
  ### Training Datasets
393
 
394
- The majority of the training data comes from the English portion of the Granary dataset [3]:
395
 
396
  - YouTube-Commons (YTC) (109.5k hours)
397
  - YODAS2 (102k hours)
@@ -482,12 +480,12 @@ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concern
482
 
483
  ## References
484
 
485
- <!-- [1] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) -->
486
 
487
- [1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
488
 
489
- [2] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
490
 
491
- [3] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
492
 
493
- [4] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)
 
241
  | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
242
  |---|---|---|
243
 
244
+ Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model [1]. It is trained mostly on the English part of the Granary dataset [4], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
245
 
246
  <figure align="center">
247
  <img src="figures/wer_comparison.png" width="1250" />
 
284
 
285
  **Architecture Type:** Unified-FastConformer-RNNT
286
 
287
+ The unified model architecture is presented in [1]. The model is based on the FastConformer encoder architecture [2] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [3] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
 
 
288
 
289
  **Network Architecture:**
290
 
 
389
 
390
  ### Training Datasets
391
 
392
+ The majority of the training data comes from the English portion of the Granary dataset [4]:
393
 
394
  - YouTube-Commons (YTC) (109.5k hours)
395
  - YODAS2 (102k hours)
 
480
 
481
  ## References
482
 
483
+ [1] [Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization](https://arxiv.org/abs/2604.19079)
484
 
485
+ [2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
486
 
487
+ [3] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
488
 
489
+ [4] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
490
 
491
+ [5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)