---
license: cc-by-nc-4.0
pipeline_tag: automatic-speech-recognition
datasets:
  - ivrit-ai/crowd-transcribe-v5
  - ivrit-ai/crowd-recital-whisper-training
language:
  - he
base_model:
  - Qwen/Qwen3-ASR-1.7B
tags:
  - automatic-speech-recognition
  - asr
  - speech
  - hebrew
  - qwen
  - fine-tuned
  - speech-recognition
  - ozlabs
model-index:
  - name: Caspi-1.7B
    results: []
---

![banner-final](https://cdn-uploads.huggingface.co/production/uploads/6613ecd9d0b48c4213d1aa40/IXT1RTd8SjM9Z5CfFFS4a.png)

# Caspi-1.7B

## Hebrew ASR, done properly.

**Caspi-1.7B** is a Hebrew automatic speech recognition model built by fine-tuning **Qwen/Qwen3-ASR-1.7B** for real Hebrew speech.

Caspi exists for one reason: the base multilingual models are strong, but Hebrew deserves a model that is **actually tuned for Hebrew** — its vocabulary, phonetic edge cases, spelling patterns, and real-world audio conditions.

This model is aimed at **single-pass Hebrew ASR** with strong quality across conversational, crowd-sourced, and broadcast-style speech.

Despite major advances in speech recognition, **Hebrew ASR has seen relatively little dedicated model development**, with most systems relying on multilingual Whisper variants.

Caspi aims to push Hebrew ASR forward by training directly on Hebrew speech data and optimizing for real-world Hebrew transcription.

### What Caspi is for

- **Hebrew transcription**
- **Single-pass ASR inference**
- **Offline and batch transcription**
- **Research, benchmarking, and production experimentation**
- **A stronger Hebrew-focused alternative to the multilingual base model**
- **Batch ASR Inference**

---

## Why Caspi

Hebrew ASR is deceptively hard.

Short function words, phonetically similar terms, compressed voice-note audio, domain-specific names, and inconsistent orthography can all wreck transcription quality. Caspi was trained specifically to push performance where generic multilingual checkpoints tend to slip.

Compared to the base model, Caspi is intended to provide:

- better **Hebrew recognition quality**
- stronger handling of **Hebrew vocabulary and orthographic patterns**
- improved robustness on **real Hebrew speech datasets**
- a more serious baseline for **Hebrew ASR evaluation and deployment**

This is not a general multilingual release.  
**Caspi is a Hebrew-specialized checkpoint.**

---

## Base model

- **Base checkpoint:** `Qwen/Qwen3-ASR-1.7B`
- **Model family:** Qwen3-ASR
- **Base paper:** *Qwen3-ASR Technical Report*

Caspi inherits the architecture and inference ecosystem of Qwen3-ASR, while adapting the model specifically for Hebrew ASR.

---


| Model | Supported Languages | Supported Dialects | Inference Mode | Audio Types |
|---|---|---|---|---|
| Caspi-1.7B  | **Hebrew (he)**, Chinese (zh), English (en), Cantonese (yue), Arabic (ar), German (de), French (fr), Spanish (es), Portuguese (pt), Indonesian (id), Italian (it), Korean (ko), Russian (ru), Thai (th), Vietnamese (vi), Japanese (ja), Turkish (tr), Hindi (hi), Malay (ms), Dutch (nl), Swedish (sv), Danish (da), Finnish (fi), Polish (pl), Czech (cs), Filipino (fil), Persian (fa), Greek (el), Hungarian (hu), Macedonian (mk), Romanian (ro) | Anhui, Dongbei, Fujian, Gansu, Guizhou, Hebei, Henan, Hubei, Hunan, Jiangxi, Ningxia, Shandong, Shaanxi, Shanxi, Sichuan, Tianjin, Yunnan, Zhejiang, Cantonese (Hong Kong accent), Cantonese (Guangdong accent), Wu language, Minnan language. | Offline / Streaming | Speech, Singing Voice, Songs with BGM |


---

## Training data

Caspi was fine-tuned on Hebrew speech-transcription data including:

- `ivrit-ai/crowd-transcribe-v5`
- `ivrit-ai/crowd-recital-whisper-training`

These datasets were used to adapt the multilingual base model toward stronger Hebrew recognition.

### Notes on the data

Training focused on **Hebrew audio + transcript pairs**. As with any ASR system, performance is highly sensitive to:

- transcript consistency
- segmentation quality
- domain mismatch
- noise and compression
- spelling normalization

Some of the hardest Hebrew ASR failure modes remain:

- short function words
- phonetically similar forms
- noisy and low-bitrate audio
- proper nouns, abbreviations, and domain-heavy vocabulary

---

## Intended use

Caspi is intended for:

- Hebrew ASR research
- transcription of Hebrew recordings
- experimentation with Hebrew speech systems
- benchmarking Hebrew ASR models
- downstream speech products and prototypes

### Example use cases

- transcribing spoken Hebrew audio
- transcribing interviews and conversations
- transcribing voice notes
- evaluating Hebrew ASR quality across domains
- building Hebrew-first speech pipelines

---

## Evaluation

Caspi was evaluated on Hebrew ASR benchmarks and internal evaluation sets.

### Current evaluation sets

- `eval-d1`
- `eval-whatsapp`
- `hebrew-speech-kan`

### Results
*WER: Word Error Rate, lower is better*

| Dataset | Caspi WER | Ivrit v3 WER |
|---|---:|---:|
| eval-d1 | **4.2%** | 5.1% |
| eval-whatsapp | **6%** | 7.2% |
| hebrew-speech-kan | 7.1% | **6.4%** |
| Matti Caspi Songs | **2.4%** | 3.7% |
| average | **4.96%** | 5.6% |

### Takeaway

Caspi improves over the compared Hebrew Whisper baseline on **eval-d1**, **eval-whatsapp**, and on the **overall average**, while remaining competitive on **KAN-style broadcast speech**.

That makes it a strong Hebrew ASR checkpoint for real-world use, especially on conversational and less curated audio.

### Evaluation notes

- If you publish benchmark claims, specify whether decoding used **greedy** or **beam search**
- Keep normalization policy consistent across models
- Comparisons are only meaningful if decoding and preprocessing conditions are matched fairly

---

## Inference

Caspi uses the same overall inference ecosystem as the base Qwen3-ASR model.

Depending on your setup, you can use:

- the `qwen-asr` package
- Transformers-based inference
- vLLM-based inference
- optional forced alignment via `Qwen/Qwen3-ForcedAligner-0.6B`

Because Caspi is a fine-tuned derivative of Qwen3-ASR-1.7B, usage is similar to the base model — just replace the model name with `OzLabs/Caspi-1.7B`.

### Quick example

```python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="path/to/hebrew_audio.wav",
    language="Hebrew",
)

print(results[0].language)
print(results[0].text)
```

---

## Python package usage

### Transformers backend

```python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
)

results = model.transcribe(
    audio="audio path / url",
    language=None,
)

print(results[0].language)
print(results[0].text)
```

### With timestamps

```python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "OzLabs/Caspi-1.7B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    max_inference_batch_size=32,
    max_new_tokens=256,
    forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
    forced_aligner_kwargs=dict(
        dtype=torch.bfloat16,
        device_map="cuda:0",
    ),
)

results = model.transcribe(
    audio=[
      "path/to/audio",
      "audio.url"
    ],
    language=["Hebrew", "English"],
    return_time_stamps=True,
)

for r in results:
    print(r.language, r.text, r.time_stamps[0])
```

### vLLM backend

```python
import torch
from qwen_asr import Qwen3ASRModel

if __name__ == '__main__':
    model = Qwen3ASRModel.LLM(
        model="OzLabs/Caspi-1.7B",
        gpu_memory_utilization=0.7,
        max_inference_batch_size=128,
        max_new_tokens=4096,
        forced_aligner="Qwen/Qwen3-ForcedAligner-0.6B",
        forced_aligner_kwargs=dict(
            dtype=torch.bfloat16,
            device_map="cuda:0",
        ),
    )

    results = model.transcribe(
        audio=[
            "path/to/audio",
            "path/to/audio",
        ],
        language=["Hebrew", "English"],
        return_time_stamps=True,
    )

    for r in results:
        print(r.language, r.text, r.time_stamps[0])
```

### Serve with vLLM

```bash
qwen-asr-serve OzLabs/Caspi-1.7B --gpu-memory-utilization 0.8 --host 0.0.0.0 --port 8000
```

### Request example

```python
import requests

url = "http://localhost:8000/v1/chat/completions"
headers = {"Content-Type": "application/json"}

data = {
    "messages": [
        {
            "role": "user",
            "content": [
                {
                    "type": "audio_url",
                    "audio_url": {
                        "url": "path/to/audio"
                    },
                }
            ],
        }
    ]
}

response = requests.post(url, headers=headers, json=data, timeout=300)
response.raise_for_status()
content = response.json()["choices"][0]["message"]["content"]
print(content)

from qwen_asr import parse_asr_output
language, text = parse_asr_output(content)
print(language)
print(text)
```

---

## Streaming inference

**Caspi** supports streaming inference through the vLLM backend.

Streaming is useful when you want lower-latency transcription, but note:

* no batch inference in streaming mode
* no timestamps in streaming mode

See the upstream Qwen3-ASR examples for the streaming backend implementation.

### Streaming demo

```bash
qwen-asr-demo-streaming \
  --asr-model-path OzLabs/Caspi-1.7B \
  --host 0.0.0.0 \
  --port 8000 \
  --gpu-memory-utilization 0.9
```

---

## Forced aligner

For timestamp prediction and alignment, Caspi can be used together with:

* `Qwen/Qwen3-ForcedAligner-0.6B`

### Example

```python
import torch
from qwen_asr import Qwen3ForcedAligner

model = Qwen3ForcedAligner.from_pretrained(
    "Qwen/Qwen3-ForcedAligner-0.6B",
    dtype=torch.bfloat16,
    device_map="cuda:0",
)

results = model.align(
    audio="path/to/audio",
    text="איך זה שכוכב אחד מעז",
    language="Hebrew",
)

print(results[0])
print(results[0][0].text, results[0][0].start_time, results[0][0].end_time)
```

---

## Offline inference with vLLM

```python
from vllm import LLM, SamplingParams
from vllm.assets.audio import AudioAsset

llm = LLM(
    model="OzLabs/Caspi-1.7B"
)

audio_asset = AudioAsset("winning_call")

conversation = [
    {
        "role": "user",
        "content": [
            {
                "type": "audio_url",
                "audio_url": {"url": audio_asset.url}
            }
        ]
    }
]

sampling_params = SamplingParams(temperature=0.01, max_tokens=256)
outputs = llm.chat(conversation, sampling_params=sampling_params)
print(outputs[0].outputs[0].text)
```

---

## Recommended usage notes

For best results:

* use reasonably clean audio when possible
* segment long audio into shorter utterances
* keep sample rate aligned with the base model’s preprocessing expectations
* use **beam search** if latency allows
* apply consistent Hebrew text normalization during evaluation

---

## Limitations

Caspi is strong, but Hebrew ASR is still hard.

Common failure modes include:

* short phonetically similar words such as `על / אל`, `אם / עם`, `לא / לו`
* noisy or low-bitrate speech
* overlapping speakers
* accented or highly informal speech
* domain-specific names, abbreviations, and slang
* code-switching between Hebrew and other languages

Performance will vary depending on:

* recording quality
* segmentation quality
* speaker style
* domain match between train and test data

---

## Ethical considerations

ASR systems can mis-transcribe people’s speech, especially under:

* noisy conditions
* accented speech
* overlapping speakers
* low-quality microphones
* compressed audio pipelines

For sensitive, high-stakes, or public-facing use cases, transcripts should be reviewed by a human.

---

## Acknowledgements

Caspi is built on top of **Qwen3-ASR-1.7B** from the Qwen team.

We also thank the creators and contributors of the Hebrew datasets used for fine-tuning, especially the Ivrit.AI community datasets.

---

## Citation

If you use Caspi in research or applications, please cite both the original Qwen3-ASR work and this checkpoint.

### Base model

```bibtex
@article{Qwen3-ASR,
  title={Qwen3-ASR Technical Report},
  author={Xian Shi and Xiong Wang and Zhifang Guo and Yongqi Wang and Pei Zhang and Xinyu Zhang and Zishan Guo and Hongkun Hao and Yu Xi and Baosong Yang and Jin Xu and Jingren Zhou and Junyang Lin},
  journal={arXiv preprint arXiv:2601.21337},
  year={2026}
}
```

### Caspi

```bibtex
@misc{caspi_hebrew_asr,
  title={Caspi-1.7B: Hebrew ASR fine-tuned from Qwen3-ASR-1.7B},
  author={Oz Labs},
  year={2026},
  howpublished={Hugging Face model card}
}
```