Instructions to use krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers.js
How to use krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx with Transformers.js:
// npm i @huggingface/transformers import { pipeline } from '@huggingface/transformers'; // Allocate pipeline const pipe = await pipeline('automatic-speech-recognition', 'krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx');
Whisper Large v3 Turbo — Hebrew (ivrit-ai) — ONNX timestamped variant for transformers.js
This repository hosts the in-browser-ready ONNX build of ivrit-ai/whisper-large-v3-turbo.
All credit for the Hebrew fine-tuning, training data curation, and
model weights goes to the ivrit-ai team (website
· GitHub · HuggingFace).
The fine-tune was trained on three ivrit-ai datasets:
crowd-transcribe-v5,
crowd-recital-whisper-training,
and knesset-plenums-whisper-training.
This repository contributes only the ONNX format conversion — no
additional training, fine-tuning, or model modification of any kind.
The conversion adds the cross_attentions.{0..3} outputs that
transformers.js v3's
_extract_token_timestamps requires for word-level timestamps (the
same shape onnx-community/whisper-large-v3-turbo_timestamped ships,
but with ivrit-ai's Hebrew-fine-tuned weights instead of OpenAI's
generic ones) and provides q8 quantized variants restricted to
MatMul ops only so the model runs in ONNX Runtime Web's WASM build.
Why this exists
For in-browser Hebrew transcription with word-level timestamps, the options before this conversion were:
onnx-community/whisper-large-v3-turbo_timestamped— generic OpenAI Whisper turbo, no Hebrew fine-tuning. Validated on a 1h 44m Hebrew podcast episode at 0.80× the word count of the ivrit-ai-served reference transcript: a consistent ~20% content deficit across the episode, including ~10 chunks where 50-60 words of dense conversation collapsed to 0-2 hallucinated tokens.ivrit-ai/whisper-large-v3-turboitself — only available as PyTorch / safetensors weights. Not directly loadable by transformers.js without ONNX conversion. The optimum-cli default export doesn't include the cross-attention outputs needed for word timestamps.
This repository closes the gap: ivrit-ai's fine-tune in the exact ONNX shape transformers.js v3 expects.
Usage (transformers.js v3)
import { pipeline } from '@huggingface/transformers'
const pipe = await pipeline(
'automatic-speech-recognition',
'krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx',
{ dtype: 'q8' }, // recommended: q8 encoder + fp32 decoder
)
const result = await pipe(audioFloat32Array, {
chunk_length_s: 30,
stride_length_s: 5,
return_timestamps: 'word',
language: 'he',
task: 'transcribe',
})
// result.chunks: [{ text, timestamp: [start_sec, end_sec] }, ...]
Run inside a Web Worker. The page must be cross-origin isolated
(serve with Cross-Origin-Opener-Policy: same-origin and
Cross-Origin-Embedder-Policy: require-corp) so SharedArrayBuffer
is available — required for multi-threaded WASM.
Files
config.json
generation_config.json
preprocessor_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
merges.txt
vocab.json
normalizer.json
onnx/
encoder_model.onnx fp32 header (405 KB) + external data file
encoder_model.onnx_data fp32 external data (2.4 GB)
encoder_model_q8.onnx q8 / 660 MB, MatMul-only dynamic quantization
encoder_model_quantized.onnx alias of _q8 (transformers.js v3 dtype='q8' file suffix)
decoder_model_merged.onnx fp32 / 909 MB (4 cross_attentions outputs added)
decoder_model_merged_q8.onnx fp32-equivalent (cross_attentions outputs inhibit quantization)
decoder_model_merged_quantized.onnx alias of _q8
Dtypes
dtype |
Encoder | Decoder | Total download | Notes |
|---|---|---|---|---|
'q8' (recommended) |
q8 / 660 MB | fp32 / 909 MB | ~1.6 GB | Best size/quality tradeoff. Encoder MatMul-only int8; decoder unquantized due to cross_attentions outputs. |
'fp32' |
fp32 / 2.4 GB | fp32 / 909 MB | ~3.3 GB | Reference precision. Verify if q8 quality regressions matter for your content. |
q4, int8, uint8, fp16, bnb4 variants are NOT shipped in this
build. Re-quantize with onnxruntime.quantization if needed.
Conversion approach
Standard optimum-cli export onnx produces a Whisper decoder with KV-cache
outputs but no cross_attentions outputs — required for word-level
timestamp extraction in transformers.js. Three steps:
- Monkey-patch
WhisperOnnxConfig.outputsto addcross_attentions.{N}per decoder layer, then calloptimum.exporters.onnx.main_exportwithmodel_kwargs={"output_attentions": True}. Result: decoder has 21 outputs total (4 cross_attentions added), matches the schema ofonnx-community/whisper-large-v3-turbo_timestamped. - Quantize encoder with
onnxruntime.quantization.quantize_dynamicrestricted toop_types_to_quantize=['MatMul']. The default quantization includesConvlayers and producesConvIntegerops that ONNX Runtime Web's WASM build does not support. - Provide
_quantized.onnxaliases for the q8 files — transformers.js v3.8.1 dtype'q8'resolves to the_quantizedfilename suffix, not_q8.
The conversion runbook above is sufficient to reproduce from scratch
on any Python 3.9+ machine with optimum[exporters,onnxruntime]
installed — about 30 minutes end to end on Apple Silicon.
Validation
Tested 2026-05-10 on real Chrome 147 / Apple Silicon M4 / 10 cores / crossOriginIsolated. Both fixtures are Hebrew speech.
| Metric | Generic Whisper turbo (Spike 1.2) | This model |
|---|---|---|
| Short fixture (60s) word count vs ivrit.ai-ref | 1.13× | 0.98× |
| Long fixture (1h 44m) word count vs ivrit.ai-ref | 0.80× | 0.979× |
| Long fixture runtime | 1.27× realtime | 0.94× realtime |
| Validity checks (cross-attn #551, in-bounds #1357, monotonic) | 4/5 | 5/5 ✓ |
| Leading-chunk zero-duration artifact | yes | no |
| 30-second chunks where ref ≥ 30 words but ours < 5 | 6 | 0 |
The "ivrit.ai-ref" baseline is the same ivrit-ai/whisper-large-v3-turbo
checkpoint run via CTranslate2 + ivrit-py + pyannote 3.1 on the same audio.
The 0.979× ratio means this ONNX build, served in-browser via
transformers.js, reaches 97.9% of the local-Python pipeline's word count
on Hebrew speech — at 0.94× realtime, no temperature-fallback wrapper
required.
License + attribution chain
| Layer | Source | License |
|---|---|---|
| Original architecture + base weights | openai/whisper-large-v3-turbo |
MIT |
| Hebrew fine-tune (this repo's parent) | ivrit-ai/whisper-large-v3-turbo |
Apache 2.0 |
| This repo (ONNX conversion only) | derivative work | Apache 2.0 |
If you use this model in research or production, please cite the ivrit-ai team for the Hebrew fine-tuning work, not this repository. This repo is just a format conversion to make their model usable in transformers.js.
@misc{ivritai-whisper-large-v3-turbo,
author = {ivrit-ai},
title = {whisper-large-v3-turbo (Hebrew finetune)},
year = {2025},
publisher = {HuggingFace},
howpublished = {\url{https://huggingface.co/ivrit-ai/whisper-large-v3-turbo}},
}
Known regressions vs the generic model
Two specific 30-second windows where this model emitted ~24 / ~16 fewer words than the generic Whisper turbo (bucket 205 at 102.5 min, bucket 156 at 78 min in the 1h 44m reference fixture). Likely q8 quantization edge cases on specific phonetic content. The net long-fixture ratio (0.979×) still holds with these in play; raise as an issue here if you reproduce them on different content.
Two mitigation options if you hit this in real-episode usage:
- Load specific chunks at
fp32instead ofq8(one-off fp32 fallback for chunks failing a quality gate). - Wrap inference in an OpenAI-Whisper-style temperature fallback loop (the same retry pattern used in the spike-1-4 worker).
Maintainer
Conversion + validation by Kobi Rivlin (@krivlin) on 2026-05-10, as part of an in-browser Hebrew podcast editor project. All model authorship is ivrit-ai's; this repo's only contribution is the ONNX conversion shape.
- Downloads last month
- 2
Model tree for krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx
Base model
openai/whisper-large-v3