--- base_model: - openai/whisper-large-v3-turbo base_model_relation: quantized license: cc-by-4.0 pipeline_tag: automatic-speech-recognition tags: - automatic-speech-recognition - audio - speech - whisper - multilingual - streaming - coreml - cuda - nvidia - apple-silicon - on-device language: - en - ar - bg - bn - cs - da - de - el - es - et - fi - fr - hi - hu - id - it - lt - lv - nl - pl - pt - ro - ru - sk - sl - sv - uk - vi --- # Elastic model: thewhisper-large-v3-turbo The project GitHub: [TheWhisper](https://github.com/TheStageAI/TheWhisper/tree/main) **TheWhisper-Large-V3-Turbo** is a fine-tuned, high-performance variant of OpenAI’s Whisper Large V3 model — optimized by **TheStage AI** for **real-time**, **low-latency**, and **low-power** speech-to-text (ASR) inference across multiple platforms, including **NVIDIA GPUs** and **Apple Silicon (CoreML)**. vanilla whisper (1) TheStage AI Whisper (1) ## Overview --- ElasticModels are the models produced by TheStage AI ANNA: Automated Neural Networks Accelerator. ANNA allows you to control model size, latency and quality with a simple slider movement, routing different compression algorithms to different layers. For each model, we have produced a series of optimized models: - **XL**: Mathematically equivalent neural network, optimized with our DNN compiler. - **L**: Near lossless model, with less than 1% degradation obtained on corresponding benchmarks. - **M**: Faster model, with accuracy degradation less than 1.5%. - **S**: The fastest model, with accuracy degradation less than 2%. Models can be accessed via TheStage AI Python SDK: ElasticModels, or deployed as Docker containers with REST API endpoints (see Deploy section). ## System Requirements --- | **Property**| **Value** | | --- | --- | | **GPU** | L40s, RTX 4090, RTX 5090, H100 | | **Python Version** | 3.10-3.12 | | **CPU** | Intel/AMD x86_64 | | **CUDA Version** | 12.8+ | ## TheStage AI Access Token Setup --- Install TheStage AI CLI and setup API token: ```bash pip install thestage thestage config set --access-token ``` ## ElasticModels --- Elastic Models provides the same interface as HuggingFace Transformers. Here is an example of how to use the thewhisper-large-v3-turbo model. ### Installation ```bash pip install 'thestage-elastic-models[nvidia]' \ --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple pip install datasets==3.6.0 librosa soundfile # only needed to load audio for the example below ``` ### Usage ```python import torch from elastic_models.transformers import AutoModelForSpeechSeq2Seq from transformers import AutoProcessor model_name = "TheStageAI/thewhisper-large-v3-turbo" hf_token = '' device = torch.device("cuda") processor = AutoProcessor.from_pretrained( model_name, token=hf_token ) model = AutoModelForSpeechSeq2Seq.from_pretrained( model_name, token=hf_token, torch_dtype=torch.float16, mode='S' ).to(device) # Load audio file from datasets import load_dataset dataset = load_dataset( "hf-internal-testing/librispeech_asr_dummy", "clean", split="validation" ) audio_sample = dataset[0]["audio"] # Process audio input_features = processor( audio_sample["array"], sampling_rate=audio_sample["sampling_rate"], return_tensors="pt" ).input_features.to(device, dtype=torch.float16) # Generate transcription with torch.inference_mode(): predicted_ids = model.generate(input_features) transcription = processor.batch_decode( predicted_ids, skip_special_tokens=True )[0] print(f"Transcription: {transcription}") ``` ## TheWhisper SpeechKit --- TheWhisper can also be used via `thestage_speechkit` for NVIDIA and Apple Silicon inference, including real-time streaming. ### Installation ```bash # Install ffmpeg (required for audio processing) # Ubuntu/Debian: apt install ffmpeg # macOS: brew install ffmpeg git clone https://github.com/TheStageAI/TheWhisper.git cd TheWhisper pip install .[nvidia] # or pip install .[apple] for Apple Silicon ``` For TheStage AI optimized engines (NVIDIA only): ```bash pip install thestage-elastic-models[nvidia] --extra-index-url https://thestage.jfrog.io/artifactory/api/pypi/pypi-thestage-ai-production/simple ``` ### NVIDIA Usage ```python from thestage_speechkit.nvidia import ASRPipeline model = ASRPipeline( model='TheStageAI/thewhisper-large-v3-turbo', model_size='S', chunk_length_s=15, batch_size=32, device='cuda' ) result = model( "path_to_your_audio.wav", chunk_length_s=15, generate_kwargs={'do_sample': False, 'num_beams': 1, 'use_cache': True} ) print(result["text"]) ``` ### Apple Silicon Usage ```python from thestage_speechkit.apple import ASRPipeline model = ASRPipeline( model='TheStageAI/thewhisper-large-v3-turbo', model_size='S', chunk_length_s=10 ) result = model( "path_to_your_audio.wav", chunk_length_s=10, generate_kwargs={'do_sample': False, 'num_beams': 1, 'use_cache': True} ) print(result["text"]) ``` ### Streaming ```python from thestage_speechkit.streaming import StreamingPipeline, MicStream, StdoutStream streaming_pipe = StreamingPipeline( model='TheStageAI/thewhisper-large-v3-turbo', model_size='S', chunk_length_s=15, platform='apple', language='en' ) mic_stream = MicStream(step_size_s=0.5) output_stream = StdoutStream() while True: chunk = mic_stream.next_chunk() if chunk is not None: approved_text, assumption = streaming_pipe(chunk) output_stream.write(approved_text, assumption) else: break ``` ## Quality Benchmarks --- We have evaluated the models using the Hugging Face Open ASR Leaderboard methodology. For each model size (S, M, L, XL), we report Word Error Rate (WER) on standard English and multilingual speech recognition benchmarks. ![Quality Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773836234-75d3e865-d537-475a-b991-fe8cda463108/TheWhisper_Large_Turbo_WER.png) ### Open ASR Leaderboard (English, WER %) | **Dataset**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | | **LibriSpeech Clean** | 1.83 | 1.8 | 1.74 | 1.73 | 1.71 | | **LibriSpeech Other** | 3.77 | 3.76 | 3.75 | 3.72 | 3.63 | | **SPGISpeech** | 1.92 | 1.93 | 1.92 | 1.88 | 1.88 | | **TED-LIUM** | 3.37 | 3.3 | 3.29 | 3.25 | 3.33 | | **VoxPopuli** | 7.37 | 6.36 | 6.34 | 6.71 | 6.28 | | **GigaSpeech** | 9.56 | 9.54 | 9.53 | 9.48 | 9.51 | | **Earnings-22** | 11.57 | 11.15 | 11.09 | 11.21 | 10.89 | | **AMI** | 9.6 | 9.35 | 9.38 | 9.14 | 9.2 | | **Mean WER** | 6.12 | 5.9 | 5.88 | 5.89 | 5.8 | ### Multilingual (WER %) | **Dataset**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | | **CoVoST2 DE** | 4.47 | 4.44 | 4.38 | 4.32 | 4.34 | | **CoVoST2 ES** | 3.41 | 3.4 | 3.35 | 3.33 | 3.3 | | **CoVoST2 FR** | 6.09 | 6.09 | 6.01 | 5.95 | 6.01 | | **CoVoST2 IT** | 3.81 | 3.66 | 3.64 | 3.74 | 3.71 | | **CoVoST2 PT** | 2.08 | 2.02 | 2.0 | 1.97 | 2.02 | | **FLEURS DE** | 4.62 | 4.66 | 4.66 | 4.5 | 4.36 | | **FLEURS ES** | 3.25 | 3.16 | 3.15 | 3.04 | 2.98 | | **FLEURS FR** | 5.18 | 5.15 | 5.27 | 5.25 | 4.99 | | **FLEURS IT** | 3.21 | 3.26 | 3.43 | 3.45 | 2.82 | | **FLEURS PT** | 4.81 | 4.78 | 4.7 | 4.7 | 4.55 | | **MLS French** | 4.67 | 4.54 | 4.48 | 4.31 | 3.77 | | **MLS German** | 4.28 | 4.16 | 4.19 | 4.12 | 3.77 | | **MLS Italian** | 6.89 | 7.01 | 7.11 | 7.36 | 6.12 | | **MLS Portuguese** | 5.69 | 5.53 | 5.34 | 6.3 | 4.56 | | **MLS Spanish** | 2.93 | 2.84 | 2.71 | 2.85 | 2.51 | | **Mean WER** | 4.36 | 4.31 | 4.29 | 4.35 | 3.99 | ## Datasets --- ### English (Open ASR Leaderboard) - **LibriSpeech Clean**: Read English speech from audiobooks, recorded in clean conditions. Tests baseline transcription accuracy on clear, well-articulated speech. - **LibriSpeech Other**: Read English speech from audiobooks with more challenging acoustic conditions, including noisier recordings and less common speakers. - **SPGISpeech**: Financial earnings calls and presentations, featuring domain-specific terminology, spontaneous speech, and diverse speaker accents. - **TEDLium**: TED conference talks covering a wide range of topics, with diverse speakers, presentation styles, and varying audio quality. - **VoxPopuli**: European Parliament event recordings in multiple languages, featuring political discourse, formal speech, and multilingual speakers. - **GigaSpeech**: Large-scale multi-domain English speech corpus from audiobooks, podcasts, and YouTube, representing diverse acoustic conditions and speaking styles. - **Earnings22**: Corporate earnings calls with financial terminology, multiple speakers, and telephone-quality audio. - **AMI**: Meeting recordings with overlapping speech, distant microphones, and natural conversational dynamics. ### Multilingual - **CoVoST2**: Common Voice Speech-To-Text 2. Built on Mozilla's Common Voice recordings, providing speech-to-text evaluation across 21 languages with diverse speakers, accents, and recording conditions. - **FLEURS**: Few-shot Learning Evaluation of Universal Representations of Speech. Covers 102 languages with read speech from Wikipedia passages. - **MLS**: Multilingual LibriSpeech. Derived from read audiobooks in 8 languages, providing large-scale multilingual ASR evaluation data. ## Metrics --- - **WER (Word Error Rate)**: Measures the proportion of word-level errors (substitutions, insertions, deletions) in the transcription compared to the reference text. Lower values indicate better accuracy. ## Latency Benchmarks --- We measured RTFx (Real-Time Factor) for each model size on various GPUs. RTFx indicates how many times faster than real-time the model transcribes audio. Higher RTFx is better. ![Latency Benchmarking](https://cdn.thestage.ai/production/cms_file_upload/1773836539-93e2e10c-e663-4396-ac64-d7f700246514/TheWhisper_Large_Turbo_RTF_h100.png) ### RTFx, batch size 1 | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | | **H100** | 192.9 | 191.0 | 184.2 | 182.9 | 109.2 | | **L40s** | 162.6 | 155.2 | 152.0 | 147.0 | 81.8 | | **GeForce RTX 5090** | 172 | 164 | 169 | 149 | 114 | | **GeForce RTX 4090** | 231.1 | 229.5 | 217.3 | 183.9 | 157.6 | ### RTFx, batched | **GPU/Model Size**| **S**| **M**| **L**| **XL**| **Original** | | --- | --- | --- | --- | --- | --- | | **RTX 4090 (bs=24)** | 895 | 880 | 871 | 790 | 702 | | **L40s (bs=32)** | 877 | 850 | 829 | 829 | 539 | | **RTX 5090 (bs=32)** | 983 | 983 | 983 | 976 | 484 | | **H100 (bs=64)** | 1950 | 1828 | 1820 | 1818 | 967 | ## Benchmarking Methodology --- The benchmarking was performed on a single GPU using a 10-minute audio file resampled to 16kHz mono. RTFx (Real-Time Factor) is calculated as `audio_duration / transcription_time` — higher values mean faster-than-real-time transcription. > **Algorithm summary:** > 1. Load the thewhisper-large-v3-turbo model with the specified size (S, M, L, XL, original). > 2. Load a 10-minute audio file and resample to 16kHz mono. > 3. Run a warm-up pass to initialize GPU caches. > 4. Synchronize the GPU, record the start time. > 5. Run the transcription pipeline with the specified batch size and chunk length. > 6. Synchronize the GPU, record the end time. > 7. Calculate RTFx as `audio_duration / time_taken`. ## Serving with Docker Image --- For serving with Nvidia GPUs, we provide ready-to-go Docker containers with OpenAI-compatible API endpoints. Using our containers you can set up an inference endpoint on any desired cloud/serverless providers as well as on-premise servers. You can also use this container to run inference through TheStage AI platform. ### Prebuilt image from ECR --- Pull docker image and start inference container: ```bash docker pull public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-stt-streaming-24.09c ``` ```bash docker run --rm -it \ --name triton-stt \ --gpus all \ -p 127.0.0.1:80:80 \ -v "$HOME/.cache:/opt/project/.cache/" \ -e MODEL_REPO=TheStageAI/thewhisper-large-v3-turbo \ -e MODEL_SIZE= \ -e MODEL_BATCH= \ -e PIPELINE_MAX_BATCH_SIZE= \ -e CHUNK_LENGTH= \ -e HUGGINGFACE_ACCESS_TOKEN= \ -e THESTAGE_AUTH_TOKEN= \ public.ecr.aws/i3f7g5s7/thestage/elastic-models:0.2.0-stt-streaming-24.09c ``` | **Parameter** | **Description** | |----------------------------|------------------------------------------------------------------------------------------------------| | `` | Available: S, M, L, XL. | | `` | Maximum batch size for the model. | | `` | Maximum batch size for the ASR pipeline processing. | | `` | Audio chunk length in seconds (e.g., 10, 15, 20, 30). | | `` | Hugging Face access token. | | `` | TheStage token generated on the platform (Profile -> Access tokens). | ## Invocation --- ### CLI ```bash elastic-models-client client stt --sample sample.wav --lang-id en ``` ### cURL ```bash curl -X POST http://127.0.0.1:80/v1/audio/transcriptions \ -H "Authorization: Bearer 123" \ -H "X-Lang-Id: en" \ -H "X-Model-Name: --bs" \ -F "file=@sample.wav" ``` ## Endpoint Parameters --- ### Method > **POST** `/v1/audio/transcriptions` ### Header Parameters > `Authorization`: `string` > > Bearer token for authentication. > `X-Lang-Id`: `string` > > Language of the audio (e.g., "en", "es", "fr"). > `X-Model-Name`: `string` > > Specifies the model to use for transcription. Format: `--bs`, where `` is one of `S`, `M`, `L`, `XL`, `original` and `` is the `MODEL_BATCH` configured during container startup. ### Input Body > `file` : `binary` > > The audio file to transcribe (multipart/form-data). ## Links --- * __Platform__: [app.thestage.ai](https://app.thestage.ai) * __Subscribe for updates__: [TheStageAI X](https://x.com/TheStageAI) * __Contact email__: contact@thestage.ai