Whisper Large v3 Turbo — Hebrew (ivrit-ai) — ONNX timestamped variant for transformers.js

This repository hosts the in-browser-ready ONNX build of ivrit-ai/whisper-large-v3-turbo.

All credit for the Hebrew fine-tuning, training data curation, and model weights goes to the ivrit-ai team (website · GitHub · HuggingFace). The fine-tune was trained on three ivrit-ai datasets: crowd-transcribe-v5, crowd-recital-whisper-training, and knesset-plenums-whisper-training.

This repository contributes only the ONNX format conversion — no additional training, fine-tuning, or model modification of any kind. The conversion adds the cross_attentions.{0..3} outputs that transformers.js v3's _extract_token_timestamps requires for word-level timestamps (the same shape onnx-community/whisper-large-v3-turbo_timestamped ships, but with ivrit-ai's Hebrew-fine-tuned weights instead of OpenAI's generic ones) and provides q8 quantized variants restricted to MatMul ops only so the model runs in ONNX Runtime Web's WASM build.

Why this exists

For in-browser Hebrew transcription with word-level timestamps, the options before this conversion were:

onnx-community/whisper-large-v3-turbo_timestamped — generic OpenAI Whisper turbo, no Hebrew fine-tuning. Validated on a 1h 44m Hebrew podcast episode at 0.80× the word count of the ivrit-ai-served reference transcript: a consistent ~20% content deficit across the episode, including ~10 chunks where 50-60 words of dense conversation collapsed to 0-2 hallucinated tokens.
ivrit-ai/whisper-large-v3-turbo itself — only available as PyTorch / safetensors weights. Not directly loadable by transformers.js without ONNX conversion. The optimum-cli default export doesn't include the cross-attention outputs needed for word timestamps.

This repository closes the gap: ivrit-ai's fine-tune in the exact ONNX shape transformers.js v3 expects.

Usage (transformers.js v3)

import { pipeline } from '@huggingface/transformers'

const pipe = await pipeline(
  'automatic-speech-recognition',
  'krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx',
  { dtype: 'q8' },  // recommended: q8 encoder + fp32 decoder
)

const result = await pipe(audioFloat32Array, {
  chunk_length_s: 30,
  stride_length_s: 5,
  return_timestamps: 'word',
  language: 'he',
  task: 'transcribe',
})

// result.chunks: [{ text, timestamp: [start_sec, end_sec] }, ...]

Run inside a Web Worker. The page must be cross-origin isolated (serve with Cross-Origin-Opener-Policy: same-origin and Cross-Origin-Embedder-Policy: require-corp) so SharedArrayBuffer is available — required for multi-threaded WASM.

Files

config.json
generation_config.json
preprocessor_config.json
tokenizer.json
tokenizer_config.json
special_tokens_map.json
added_tokens.json
merges.txt
vocab.json
normalizer.json
onnx/
  encoder_model.onnx              fp32 header (405 KB) + external data file
  encoder_model.onnx_data         fp32 external data (2.4 GB)
  encoder_model_q8.onnx           q8 / 660 MB, MatMul-only dynamic quantization
  encoder_model_quantized.onnx    alias of _q8 (transformers.js v3 dtype='q8' file suffix)
  decoder_model_merged.onnx       fp32 / 909 MB (4 cross_attentions outputs added)
  decoder_model_merged_q8.onnx    fp32-equivalent (cross_attentions outputs inhibit quantization)
  decoder_model_merged_quantized.onnx  alias of _q8

Dtypes

`dtype`	Encoder	Decoder	Total download	Notes
`'q8'` (recommended)	q8 / 660 MB	fp32 / 909 MB	~1.6 GB	Best size/quality tradeoff. Encoder MatMul-only int8; decoder unquantized due to cross_attentions outputs.
`'fp32'`	fp32 / 2.4 GB	fp32 / 909 MB	~3.3 GB	Reference precision. Verify if q8 quality regressions matter for your content.

q4, int8, uint8, fp16, bnb4 variants are NOT shipped in this build. Re-quantize with onnxruntime.quantization if needed.

Conversion approach

Standard optimum-cli export onnx produces a Whisper decoder with KV-cache outputs but no cross_attentions outputs — required for word-level timestamp extraction in transformers.js. Three steps:

Monkey-patch WhisperOnnxConfig.outputs to add cross_attentions.{N} per decoder layer, then call optimum.exporters.onnx.main_export with model_kwargs={"output_attentions": True}. Result: decoder has 21 outputs total (4 cross_attentions added), matches the schema of onnx-community/whisper-large-v3-turbo_timestamped.
Quantize encoder with onnxruntime.quantization.quantize_dynamic restricted to op_types_to_quantize=['MatMul']. The default quantization includes Conv layers and produces ConvInteger ops that ONNX Runtime Web's WASM build does not support.
Provide _quantized.onnx aliases for the q8 files — transformers.js v3.8.1 dtype 'q8' resolves to the _quantized filename suffix, not _q8.

The conversion runbook above is sufficient to reproduce from scratch on any Python 3.9+ machine with optimum[exporters,onnxruntime] installed — about 30 minutes end to end on Apple Silicon.

Validation

Tested 2026-05-10 on real Chrome 147 / Apple Silicon M4 / 10 cores / crossOriginIsolated. Both fixtures are Hebrew speech.

Metric	Generic Whisper turbo (Spike 1.2)	This model
Short fixture (60s) word count vs ivrit.ai-ref	1.13×	0.98×
Long fixture (1h 44m) word count vs ivrit.ai-ref	0.80×	0.979×
Long fixture runtime	1.27× realtime	0.94× realtime
Validity checks (cross-attn #551, in-bounds #1357, monotonic)	4/5	5/5 ✓
Leading-chunk zero-duration artifact	yes	no
30-second chunks where ref ≥ 30 words but ours < 5	6	0

The "ivrit.ai-ref" baseline is the same ivrit-ai/whisper-large-v3-turbo checkpoint run via CTranslate2 + ivrit-py + pyannote 3.1 on the same audio. The 0.979× ratio means this ONNX build, served in-browser via transformers.js, reaches 97.9% of the local-Python pipeline's word count on Hebrew speech — at 0.94× realtime, no temperature-fallback wrapper required.

License + attribution chain

Layer	Source	License
Original architecture + base weights	`openai/whisper-large-v3-turbo`	MIT
Hebrew fine-tune (this repo's parent)	`ivrit-ai/whisper-large-v3-turbo`	Apache 2.0
This repo (ONNX conversion only)	derivative work	Apache 2.0

If you use this model in research or production, please cite the ivrit-ai team for the Hebrew fine-tuning work, not this repository. This repo is just a format conversion to make their model usable in transformers.js.

@misc{ivritai-whisper-large-v3-turbo,
  author       = {ivrit-ai},
  title        = {whisper-large-v3-turbo (Hebrew finetune)},
  year         = {2025},
  publisher    = {HuggingFace},
  howpublished = {\url{https://huggingface.co/ivrit-ai/whisper-large-v3-turbo}},
}

Known regressions vs the generic model

Two specific 30-second windows where this model emitted ~24 / ~16 fewer words than the generic Whisper turbo (bucket 205 at 102.5 min, bucket 156 at 78 min in the 1h 44m reference fixture). Likely q8 quantization edge cases on specific phonetic content. The net long-fixture ratio (0.979×) still holds with these in play; raise as an issue here if you reproduce them on different content.

Two mitigation options if you hit this in real-episode usage:

Load specific chunks at fp32 instead of q8 (one-off fp32 fallback for chunks failing a quality gate).
Wrap inference in an OpenAI-Whisper-style temperature fallback loop (the same retry pattern used in the spike-1-4 worker).

Maintainer

Conversion + validation by Kobi Rivlin (@krivlin) on 2026-05-10, as part of an in-browser Hebrew podcast editor project. All model authorship is ivrit-ai's; this repo's only contribution is the ONNX conversion shape.

Downloads last month: 2

Model tree for krivlin/whisper-large-v3-turbo-ivrit-ai-timestamped-onnx

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo