--- license: cc-by-nc-4.0 pipeline_tag: automatic-speech-recognition datasets: - ivrit-ai/crowd-transcribe-v5 - ivrit-ai/crowd-recital-whisper-training language: - he base_model: - Qwen/Qwen3-ASR-1.7B tags: - automatic-speech-recognition - asr - speech - hebrew - qwen - fine-tuned - speech-recognition - ozlabs model-index: - name: Caspi-1.7B results: [] --- ![banner-final](https://cdn-uploads.huggingface.co/production/uploads/6613ecd9d0b48c4213d1aa40/IXT1RTd8SjM9Z5CfFFS4a.png) # Caspi-1.7B ## Hebrew ASR, done properly. **Caspi-1.7B** is a Hebrew automatic speech recognition model built by fine-tuning **Qwen/Qwen3-ASR-1.7B** for real Hebrew speech. Caspi exists for one reason: the base multilingual models are strong, but Hebrew deserves a model that is **actually tuned for Hebrew** — its vocabulary, phonetic edge cases, spelling patterns, and real-world audio conditions. This model is aimed at **single-pass Hebrew ASR** with strong quality across conversational, crowd-sourced, and broadcast-style speech. Despite major advances in speech recognition, **Hebrew ASR has seen relatively little dedicated model development**, with most systems relying on multilingual Whisper variants. Caspi aims to push Hebrew ASR forward by training directly on Hebrew speech data and optimizing for real-world Hebrew transcription. ### What Caspi is for - **Hebrew transcription** - **Single-pass ASR inference** - **Offline and batch transcription** - **Research, benchmarking, and production experimentation** - **A stronger Hebrew-focused alternative to the multilingual base model** - **Batch ASR Inference** --- ## Why Caspi Hebrew ASR is deceptively hard. Short function words, phonetically similar terms, compressed voice-note audio, domain-specific names, and inconsistent orthography can all wreck transcription quality. Caspi was trained specifically to push performance where generic multilingual checkpoints tend to slip. Compared to the base model, Caspi is intended to provide: - better **Hebrew recognition quality** - stronger handling of **Hebrew vocabulary and orthographic patterns** - improved robustness on **real Hebrew speech datasets** - a more serious baseline for **Hebrew ASR evaluation and deployment** This is not a general multilingual release. **Caspi is a Hebrew-specialized checkpoint.** --- ## Base model - **Base checkpoint:** `Qwen/Qwen3-ASR-1.7B` - **Model family:** Qwen3-ASR - **Base paper:** *Qwen3-ASR Technical Report* Caspi inherits the architecture and inference ecosystem of Qwen3-ASR, while adapting the model specifically for Hebrew ASR. --- | Model | Supported Languages | Supported Dialects | Inference Mode | Audio Types | |---|---|---|---|---| | Caspi-1.7B | **Hebrew (he)**, Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) | Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. | Offline / Streaming | Speech, Singing Voice, Songs with BGM | --- ## Training data Caspi was fine-tuned on Hebrew speech-transcription data including: - `ivrit-ai/crowd-transcribe-v5` - `ivrit-ai/crowd-recital-whisper-training` These datasets were used to adapt the multilingual base model toward stronger Hebrew recognition. ### Notes on the data Training focused on **Hebrew audio + transcript pairs**. As with any ASR system, performance is highly sensitive to: - transcript consistency - segmentation quality - domain mismatch - noise and compression - spelling normalization Some of the hardest Hebrew ASR failure modes remain: - short function words - phonetically similar forms - noisy and low-bitrate audio - proper nouns, abbreviations, and domain-heavy vocabulary --- ## Intended use Caspi is intended for: - Hebrew ASR research - transcription of Hebrew recordings - experimentation with Hebrew speech systems - benchmarking Hebrew ASR models - downstream speech products and prototypes ### Example use cases - transcribing spoken Hebrew audio - transcribing interviews and conversations - transcribing voice notes - evaluating Hebrew ASR quality across domains - building Hebrew-first speech pipelines --- ## Evaluation Caspi was evaluated on Hebrew ASR benchmarks and internal evaluation sets. ### Current evaluation sets - `eval-d1` - `eval-whatsapp` - `hebrew-speech-kan` ### Results *WER: Word Error Rate, lower is better* | Dataset | Caspi WER | Ivrit v3 WER | |---|---:|---:| | eval-d1 | **4.2%** | 5.1% | | eval-whatsapp | **6%** | 7.2% | | hebrew-speech-kan | 7.1% | **6.4%** | | Matti Caspi Songs | **2.4%** | 3.7% | | average | **4.96%** | 5.6% | ### Takeaway Caspi improves over the compared Hebrew Whisper baseline on **eval-d1**, **eval-whatsapp**, and on the **overall average**, while remaining competitive on **KAN-style broadcast speech**. That makes it a strong Hebrew ASR checkpoint for real-world use, especially on conversational and less curated audio. ### Evaluation notes - If you publish benchmark claims, specify whether decoding used **greedy** or **beam search** - Keep normalization policy consistent across models - Comparisons are only meaningful if decoding and preprocessing conditions are matched fairly --- ## Inference Caspi uses the same overall inference ecosystem as the base Qwen3-ASR model. Depending on your setup, you can use: - the `qwen-asr` package - Transformers-based inference - vLLM-based inference - optional forced alignment via `Qwen/Qwen3-ForcedAligner-0.6B` Because Caspi is a fine-tuned derivative of Qwen3-ASR-1.7B, usage is similar to the base model — just replace the model name with `OzLabs/Caspi-1.7B`. ### Quick example ```python import torch from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained( "OzLabs/Caspi-1.7B", dtype=torch.bfloat16, device_map="cuda:0", max_inference_batch_size=32, max_new_tokens=256, ) results = model.transcribe( audio="path/to/hebrew_audio.wav", language="Hebrew", ) print(results[0].language) print(results[0].text) ``` --- ## Python package usage ### Transformers backend ```python import torch from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained( "OzLabs/Caspi-1.7B", dtype=torch.bfloat16, device_map="cuda:0", max_inference_batch_size=32, max_new_tokens=256, ) results = model.transcribe( audio="audio path / url", language=None, ) print(results[0].language) print(results[0].text) ``` ### With timestamps ```python import torch from qwen_asr import Qwen3ASRModel model = Qwen3ASRModel.from_pretrained( "OzLabs/Caspi-1.7B", dtype=torch.bfloat16, device_map="cuda:0", max_inference_batch_size=32, max_new_tokens=256, forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B", forced_aligner_kwargs=dict( dtype=torch.bfloat16, device_map="cuda:0", ), ) results = model.transcribe( audio=[ "path/to/audio", "audio.url" ], language=["Hebrew", "English"], return_time_stamps=True, ) for r in results: print(r.language, r.text, r.time_stamps[0]) ``` ### vLLM backend ```python import torch from qwen_asr import Qwen3ASRModel if __name__ == '__main__': model = Qwen3ASRModel.LLM( model="OzLabs/Caspi-1.7B", gpu_memory_utilization=0.7, max_inference_batch_size=128, max_new_tokens=4096, forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B", forced_aligner_kwargs=dict( dtype=torch.bfloat16, device_map="cuda:0", ), ) results = model.transcribe( audio=[ "path/to/audio", "path/to/audio", ], language=["Hebrew", "English"], return_time_stamps=True, ) for r in results: print(r.language, r.text, r.time_stamps[0]) ``` ### Serve with vLLM ```bash qwen-asr-serve OzLabs/Caspi-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000 ``` ### Request example ```python import requests url = "http://localhost:8000/v1/chat/completions" headers = {"Content-Type": "application/json"} data = { "messages": [ { "role": "user", "content": [ { "type": "audio_url", "audio_url": { "url": "path/to/audio" }, } ], } ] } response = requests.post(url, headers=headers, json=data, timeout=300) response.raise_for_status() content = response.json()["choices"][0]["message"]["content"] print(content) from qwen_asr import parse_asr_output language, text = parse_asr_output(content) print(language) print(text) ``` --- ## Streaming inference **Caspi** supports streaming inference through the vLLM backend. Streaming is useful when you want lower-latency transcription, but note: * no batch inference in streaming mode * no timestamps in streaming mode See the upstream Qwen3-ASR examples for the streaming backend implementation. ### Streaming demo ```bash qwen-asr-demo-streaming \ --asr-model-path OzLabs/Caspi-1.7B \ --host 0.0.0.0 \ --port 8000 \ --gpu-memory-utilization 0.9 ``` --- ## Forced aligner For timestamp prediction and alignment, Caspi can be used together with: * `Qwen/Qwen3-ForcedAligner-0.6B` ### Example ```python import torch from qwen_asr import Qwen3ForcedAligner model = Qwen3ForcedAligner.from_pretrained( "Qwen/Qwen3-ForcedAligner-0.6B", dtype=torch.bfloat16, device_map="cuda:0", ) results = model.align( audio="path/to/audio", text="איך זה שכוכב אחד מעז", language="Hebrew", ) print(results[0]) print(results[0][0].text, results[0][0].start_time, results[0][0].end_time) ``` --- ## Offline inference with vLLM ```python from vllm import LLM, SamplingParams from vllm.assets.audio import AudioAsset llm = LLM( model="OzLabs/Caspi-1.7B" ) audio_asset = AudioAsset("winning_call") conversation = [ { "role": "user", "content": [ { "type": "audio_url", "audio_url": {"url": audio_asset.url} } ] } ] sampling_params = SamplingParams(temperature=0.01, max_tokens=256) outputs = llm.chat(conversation, sampling_params=sampling_params) print(outputs[0].outputs[0].text) ``` --- ## Recommended usage notes For best results: * use reasonably clean audio when possible * segment long audio into shorter utterances * keep sample rate aligned with the base model’s preprocessing expectations * use **beam search** if latency allows * apply consistent Hebrew text normalization during evaluation --- ## Limitations Caspi is strong, but Hebrew ASR is still hard. Common failure modes include: * short phonetically similar words such as `על / אל`, `אם / עם`, `לא / לו` * noisy or low-bitrate speech * overlapping speakers * accented or highly informal speech * domain-specific names, abbreviations, and slang * code-switching between Hebrew and other languages Performance will vary depending on: * recording quality * segmentation quality * speaker style * domain match between train and test data --- ## Ethical considerations ASR systems can mis-transcribe people’s speech, especially under: * noisy conditions * accented speech * overlapping speakers * low-quality microphones * compressed audio pipelines For sensitive, high-stakes, or public-facing use cases, transcripts should be reviewed by a human. --- ## Acknowledgements Caspi is built on top of **Qwen3-ASR-1.7B** from the Qwen team. We also thank the creators and contributors of the Hebrew datasets used for fine-tuning, especially the Ivrit.AI community datasets. --- ## Citation If you use Caspi in research or applications, please cite both the original Qwen3-ASR work and this checkpoint. ### Base model ```bibtex @article{Qwen3-ASR, title={Qwen3-ASR Technical Report}, author={Xian Shi and Xiong Wang and Zhifang Guo and Yongqi Wang and Pei Zhang and Xinyu Zhang and Zishan Guo and Hongkun Hao and Yu Xi and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin}, journal={arXiv preprint arXiv:2601.21337}, year={2026} } ``` ### Caspi ```bibtex @misc{caspi_hebrew_asr, title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B}, author={Oz Labs}, year={2026}, howpublished={Hugging Face model card} } ```