nvidia
/

parakeet-unified-en-0.6b

@@ -241,7 +241,7 @@ pipeline_tag: automatic-speech-recognition
 | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
 |---|---|---|
-Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model. It is trained mostly on the English part of the Granary dataset [3], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
 <figure align="center">
   <img src="figures/wer_comparison.png" width="1250" />
@@ -284,9 +284,7 @@ This model is for transcription of English audio in offline and streaming modes.
 **Architecture Type:** Unified-FastConformer-RNNT
-The model is based on the FastConformer encoder architecture [1] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [2] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
-The paper with the details of the model architecture and training will be released soon.
 **Network Architecture:**
@@ -391,7 +389,7 @@ We would recommend to use the following context parameters for different latenci
 ### Training Datasets
-The majority of the training data comes from the English portion of the Granary dataset [3]:
 - YouTube-Commons (YTC) (109.5k hours)
 - YODAS2 (102k hours)
@@ -482,12 +480,12 @@ Please report model quality, risk, security vulnerabilities or NVIDIA AI Concern
 ## References
-<!-- [1] [Stateful Conformer with Cache-based Inference for Streaming Automatic Speech Recognition](https://arxiv.org/abs/2312.17279) -->
-[1] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
-[2] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
-[3] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
-[4] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)

 | [Model architecture](#model-architecture) | [Model size](#model-architecture) | [Language](#datasets) |
 |---|---|---|
+Parakeet-unified-en-0.6b is an English automatic speech recognition (ASR) model based on transducer architecture (RNN-T) combining both offline and streaming inference (with a minimum latency of 160ms) in one model [1]. It is trained mostly on the English part of the Granary dataset [4], which contains approximately 250,000 hours of US English (en-US) speech across diverse acoustic conditions. The model transcribes speech to English alphabet, spaces, and apostrophes with punctuation and captalization support.
 <figure align="center">
   <img src="figures/wer_comparison.png" width="1250" />
 **Architecture Type:** Unified-FastConformer-RNNT
+The unified model architecture is presented in [1]. The model is based on the FastConformer encoder architecture [2] with 24 encoder layers and an RNNT (Recurrent Neural Network Transducer) decoder. The model was trained jointly in offline and streaming modes. In the offline mode we used standard offline training with full-context self-attention and non-causal convolutions. In the streaming mode we applied chunked self-attention masks (incluing left, middle/chunk and right context) together with Dynamic Chunked Convolutions inside each FastConformer layer [3] to adapt the model to both decoding scenarios. We also introduced a novel mode-consistency regularization loss to further reduce the gap between offline and streaming performance. All the model parameters are shared between offline and streaming modes (encoder, predictor, and joint networks), including initial x8 subsampling with non-causal convolutions.
 **Network Architecture:**
 ### Training Datasets
+The majority of the training data comes from the English portion of the Granary dataset [4]:
 - YouTube-Commons (YTC) (109.5k hours)
 - YODAS2 (102k hours)
 ## References
+[1] [Reducing the Offline-Streaming Gap for Unified ASR Transducer with Consistency Regularization](https://arxiv.org/abs/2604.19079)
+[2] [Fast Conformer with Linearly Scalable Attention for Efficient Speech Recognition](https://arxiv.org/abs/2305.05084)
+[3] [Dynamic Chunk Convolution for Unified Streaming and Non-Streaming Conformer ASR](https://arxiv.org/abs/2304.09325)
+[4] [NVIDIA Granary](https://huggingface.co/datasets/nvidia/Granary)
+[5] [NVIDIA NeMo Framework](https://github.com/NVIDIA/NeMo)