You need to agree to share your contact information to access this model
This repository is publicly accessible, but you have to accept the conditions to access its files and content.
Aby uzyskać dostęp do modelu Eskulap-ASR, musisz zaakceptować Medalion Eskulap-ASR Community License v1.0. Model jest bezpłatny dla badań, użytku prywatnego, niekomercyjnego oraz komercyjnego użytku podmiotów z rocznym obrotem brutto nieprzekraczającym 1 000 000 PLN. Większe podmioty komercyjne wymagają odrębnej licencji komercyjnej od Medalion Technology P.S.A.
Log in or Sign Up to review the conditions and access this model content.
Whisper-large-v3-turbo Polish Medical ASR
A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an
anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay)
and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.
This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).
Benchmark results (content WER — lowercase, no punctuation)
Held-out test sets (fair-eval methodology — no train/test text overlap):
| Test Set | Base whisper-large-v3-turbo | This model | Δ (pp) | Relative |
|---|---|---|---|---|
| admed_anoni (medical, synthetic) | 16.58 % | 11.30 % | −5.28 | −32 % |
| admed_human (medical, human read) | 17.07 % | 7.64 % | −9.43 | −55 % |
| gemini (medical test2) | 6.43 % | 4.88 % | −1.55 | −24 % |
| bigos (general Polish) | 5.37 % | 5.50 % | +0.13 | +2 % |
| VoxPopuli (OOD, EU Parliament) | 15.88 % | 9.28 % | −6.60 | −42 % |
No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.
Error quality (qualitative analysis, n=2000 medical samples)
| Metric | This model |
|---|---|
| Regressions (FT worse than base, >5pp per sample) | 59 / 2000 (2.95%) |
| Improvements (FT better than base, >5pp per sample) | 622 / 2000 (31.1%) |
| Unchanged | 1319 / 2000 (66.0%) |
Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.
Training recipe
| Component | Value |
|---|---|
| Base model | openai/whisper-large-v3-turbo |
| Adapter | LoRA r=64, α=128, dropout=0.0 |
| LoRA targets | encoder + decoder attention + FFN projections (49M trainable params) |
| Learning rate | 2e-4 (cosine, 10% warmup) |
| Epochs | 5 |
| Batch size | 16 × 4 GPUs |
| Precision | fp16, gradient checkpointing |
| Anti-forgetting | KD α=0.3, T=2.0 from frozen base |
| Data mix | Medical × 2 oversampled + bigos 10k general replay |
| Post-training | Partial LoRA merge: α=0.75 weight interpolation |
Partial merge technique
Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:
final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights
This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].
Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.
Datasets
| Dataset | Role | Samples (train) |
|---|---|---|
lion-ai/admed_voice (admed_anoni) |
Medical (synthetic) | 8,516 × 2 |
lion-ai/admed_voice (admed_human) |
Medical (human read) | 5,693 × 2 |
lion-ai/pl_med_asr_test2 |
Medical (test2) | 1,301 × 2 |
lion-ai/bigos |
General Polish (replay) | 10,000 |
Evaluation uses held-out test splits from the datasets above, plus 200 out-of-distribution samples from VoxPopuli European Parliament.
Usage
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")
# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)
Why anti-forgetting matters
Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:
- Data replay — mixing general-domain (bigos) samples in training
- Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
- Medical oversampling — repeats the medical training data 2× to shift the balance
- Partial merge — weight interpolation at deploy time provides implicit regularization
Result: strong medical WER improvement with no general-domain forgetting.
Known limitations
- Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
- ~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
- Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.
Related work
Part of the Eskulap project — Polish medical ASR research.
See also the smaller variant based on openai/whisper-medium.
Intended use and medical disclaimer
This model is an automatic speech recognition tool. It is not a certified medical device, not a diagnostic or therapeutic system, and not a standalone tool for clinical decision-making.
Outputs must not be used as the sole basis for diagnosis, treatment, clinical decisions, administrative decisions about patients, or medical record-keeping without appropriate verification by a qualified human. In any clinical setting, the deployer is responsible for validation, human oversight, risk assessment, and compliance with applicable healthcare, data-protection and patient-rights regulations (including GDPR).
License and usage
This model is provided under the Medalion Eskulap-ASR Community License v1.0. It is free for research, personal, non-commercial use, and commercial use by entities with annual gross revenue not exceeding 1,000,000 PLN. Larger commercial users require a separate commercial license from Medalion Technology P.S.A.
- Full license text: LICENSE
- Commercial licensing inquiries: kontakt@medalion.tech
Made by
TheThelion.ai Research Group.
Project lead: Aleskander Obuchowski
Special thanks to:
- Maciej Gierczak
- Kinga Marszałkowska
- Mikołaj Badocha
- Downloads last month
- 25
Model tree for lion-ai/eskulap-asr-turbo-beta
Base model
openai/whisper-large-v3