You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Aby uzyskać dostęp do modelu Eskulap-ASR, musisz zaakceptować Medalion Eskulap-ASR Community License v1.0. Model jest bezpłatny dla badań, użytku prywatnego, niekomercyjnego oraz komercyjnego użytku podmiotów z rocznym obrotem brutto nieprzekraczającym 1 000 000 PLN. Większe podmioty komercyjne wymagają odrębnej licencji komercyjnej od Medalion Technology P.S.A.

Whisper-large-v3-turbo Polish Medical ASR

A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay) and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.

This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).

Benchmark results (content WER — lowercase, no punctuation)

Held-out test sets (fair-eval methodology — no train/test text overlap):

Test Set	Base whisper-large-v3-turbo	This model	Δ (pp)	Relative
admed_anoni (medical, synthetic)	16.58 %	11.30 %	−5.28	−32 %
admed_human (medical, human read)	17.07 %	7.64 %	−9.43	−55 %
gemini (medical test2)	6.43 %	4.88 %	−1.55	−24 %
bigos (general Polish)	5.37 %	5.50 %	+0.13	+2 %
VoxPopuli (OOD, EU Parliament)	15.88 %	9.28 %	−6.60	−42 %

No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.

Error quality (qualitative analysis, n=2000 medical samples)

Metric	This model
Regressions (FT worse than base, >5pp per sample)	59 / 2000 (2.95%)
Improvements (FT better than base, >5pp per sample)	622 / 2000 (31.1%)
Unchanged	1319 / 2000 (66.0%)

Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.

Training recipe

Component	Value
Base model	`openai/whisper-large-v3-turbo`
Adapter	LoRA r=64, α=128, dropout=0.0
LoRA targets	encoder + decoder attention + FFN projections (49M trainable params)
Learning rate	2e-4 (cosine, 10% warmup)
Epochs	5
Batch size	16 × 4 GPUs
Precision	fp16, gradient checkpointing
Anti-forgetting	KD α=0.3, T=2.0 from frozen base
Data mix	Medical × 2 oversampled + bigos 10k general replay
Post-training	Partial LoRA merge: α=0.75 weight interpolation

Partial merge technique

Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:

final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights

This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].

Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.

Datasets

Dataset	Role	Samples (train)
`lion-ai/admed_voice` (admed_anoni)	Medical (synthetic)	8,516 × 2
`lion-ai/admed_voice` (admed_human)	Medical (human read)	5,693 × 2
`lion-ai/pl_med_asr_test2`	Medical (test2)	1,301 × 2
`lion-ai/bigos`	General Polish (replay)	10,000

Evaluation uses held-out test splits from the datasets above, plus 200 out-of-distribution samples from VoxPopuli European Parliament.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")

# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
    predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Why anti-forgetting matters

Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:

Data replay — mixing general-domain (bigos) samples in training
Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
Medical oversampling — repeats the medical training data 2× to shift the balance
Partial merge — weight interpolation at deploy time provides implicit regularization

Result: strong medical WER improvement with no general-domain forgetting.

Known limitations

Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.

Related work

Part of the Eskulap project — Polish medical ASR research. See also the smaller variant based on openai/whisper-medium.

Intended use and medical disclaimer

This model is an automatic speech recognition tool. It is not a certified medical device, not a diagnostic or therapeutic system, and not a standalone tool for clinical decision-making.

Outputs must not be used as the sole basis for diagnosis, treatment, clinical decisions, administrative decisions about patients, or medical record-keeping without appropriate verification by a qualified human. In any clinical setting, the deployer is responsible for validation, human oversight, risk assessment, and compliance with applicable healthcare, data-protection and patient-rights regulations (including GDPR).

License and usage

This model is provided under the Medalion Eskulap-ASR Community License v1.0. It is free for research, personal, non-commercial use, and commercial use by entities with annual gross revenue not exceeding 1,000,000 PLN. Larger commercial users require a separate commercial license from Medalion Technology P.S.A.

Full license text: LICENSE
Commercial licensing inquiries: kontakt@medalion.tech

Made by

TheThelion.ai Research Group.

Project lead: Aleskander Obuchowski

Special thanks to:

Maciej Gierczak
Kinga Marszałkowska
Mikołaj Badocha

Downloads last month: 25

Safetensors

Model size

0.8B params

Tensor type

F16

Model tree for lion-ai/eskulap-asr-turbo-beta

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo