Automatic Speech Recognition
Safetensors
Polish
whisper
medical
polish
anti-forgetting
knowledge-distillation
weight-interpolation

You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Aby uzyskać dostęp do modelu Eskulap-ASR, musisz zaakceptować Medalion Eskulap-ASR Community License v1.0. Model jest bezpłatny dla badań, użytku prywatnego, niekomercyjnego oraz komercyjnego użytku podmiotów z rocznym obrotem brutto nieprzekraczającym 1 000 000 PLN. Większe podmioty komercyjne wymagają odrębnej licencji komercyjnej od Medalion Technology P.S.A.

Log in or Sign Up to review the conditions and access this model content.

Whisper-large-v3-turbo Polish Medical ASR

A fine-tuned openai/whisper-large-v3-turbo specialized for Polish medical speech using an anti-forgetting training recipe (knowledge distillation + medical oversampling + general-domain replay) and partial LoRA merge (α=0.75 weight interpolation) for implicit regularization.

This model reduces content WER on held-out Polish medical test sets by 42–50% relative to the base model while also improving performance on general Polish speech (including out-of-distribution European Parliament audio).

Benchmark results (content WER — lowercase, no punctuation)

Held-out test sets (fair-eval methodology — no train/test text overlap):

Test Set Base whisper-large-v3-turbo This model Δ (pp) Relative
admed_anoni (medical, synthetic) 16.58 % 11.30 % −5.28 −32 %
admed_human (medical, human read) 17.07 % 7.64 % −9.43 −55 %
gemini (medical test2) 6.43 % 4.88 % −1.55 −24 %
bigos (general Polish) 5.37 % 5.50 % +0.13 +2 %
VoxPopuli (OOD, EU Parliament) 15.88 % 9.28 % −6.60 −42 %

No catastrophic forgetting: bigos (general Polish) is essentially unchanged (+0.13pp). The model even improves on out-of-distribution formal Polish (VoxPopuli −6.6pp) because fine-tuning with Polish data fixes a language-detection issue in the base model.

Error quality (qualitative analysis, n=2000 medical samples)

Metric This model
Regressions (FT worse than base, >5pp per sample) 59 / 2000 (2.95%)
Improvements (FT better than base, >5pp per sample) 622 / 2000 (31.1%)
Unchanged 1319 / 2000 (66.0%)

Regression types are predominantly minor: Polish compound-word boundary shifts ("niereagujące" → "nie reagujące"), rare grammar insertions, and 2 repeat-loop hallucinations on audio where the base model also fails. No systematic formatting habits or deploy-hazardous patterns were detected.

Training recipe

Component Value
Base model openai/whisper-large-v3-turbo
Adapter LoRA r=64, α=128, dropout=0.0
LoRA targets encoder + decoder attention + FFN projections (49M trainable params)
Learning rate 2e-4 (cosine, 10% warmup)
Epochs 5
Batch size 16 × 4 GPUs
Precision fp16, gradient checkpointing
Anti-forgetting KD α=0.3, T=2.0 from frozen base
Data mix Medical × 2 oversampled + bigos 10k general replay
Post-training Partial LoRA merge: α=0.75 weight interpolation

Partial merge technique

Instead of fully merging the LoRA adapter (which overfits on some test sets), we interpolate weights:

final_weights = 0.75 × merged_lora_weights + 0.25 × base_weights

This acts as implicit regularization, reducing per-sample regressions by ~20% compared to full merge while preserving nearly all of the medical WER improvement. The optimal α=0.75 was found via an 8-point sweep across [0.25, 0.85].

Training: ~6h on 4×A100 (SXM4-40GB). Partial merge adds ~5 min of post-processing.

Datasets

Dataset Role Samples (train)
lion-ai/admed_voice (admed_anoni) Medical (synthetic) 8,516 × 2
lion-ai/admed_voice (admed_human) Medical (human read) 5,693 × 2
lion-ai/pl_med_asr_test2 Medical (test2) 1,301 × 2
lion-ai/bigos General Polish (replay) 10,000

Evaluation uses held-out test splits from the datasets above, plus 200 out-of-distribution samples from VoxPopuli European Parliament.

Usage

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch

model_id = "lion-ai/eskulap-asr-turbo-beta"
model = WhisperForConditionalGeneration.from_pretrained(model_id, torch_dtype=torch.float16).to("cuda")
processor = WhisperProcessor.from_pretrained(model_id, language="Polish", task="transcribe")

# Inference
import librosa
audio, sr = librosa.load("sample.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt").to("cuda")
inputs["input_features"] = inputs["input_features"].half()
with torch.no_grad():
    predicted_ids = model.generate(**inputs, language="pl", task="transcribe")
transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)[0]
print(transcription)

Why anti-forgetting matters

Naively fine-tuning Whisper on medical-only data destroys performance on general Polish. This recipe combines three techniques:

  1. Data replay — mixing general-domain (bigos) samples in training
  2. Knowledge distillation — KL divergence loss to frozen base preserves its output distribution
  3. Medical oversampling — repeats the medical training data 2× to shift the balance
  4. Partial merge — weight interpolation at deploy time provides implicit regularization

Result: strong medical WER improvement with no general-domain forgetting.

Known limitations

  • Medical terminology with complex Polish morphology (e.g., "pozapiramidowego", "gruczolakorak") remains challenging — both base and fine-tuned models make similar errors on these terms (acoustic limit).
  • ~3% of samples show minor regressions vs base (mostly compound-word boundary shifts).
  • Trained primarily on read/dictated medical speech; spontaneous conversational medical speech may differ.

Related work

Part of the Eskulap project — Polish medical ASR research. See also the smaller variant based on openai/whisper-medium.

Intended use and medical disclaimer

This model is an automatic speech recognition tool. It is not a certified medical device, not a diagnostic or therapeutic system, and not a standalone tool for clinical decision-making.

Outputs must not be used as the sole basis for diagnosis, treatment, clinical decisions, administrative decisions about patients, or medical record-keeping without appropriate verification by a qualified human. In any clinical setting, the deployer is responsible for validation, human oversight, risk assessment, and compliance with applicable healthcare, data-protection and patient-rights regulations (including GDPR).

License and usage

This model is provided under the Medalion Eskulap-ASR Community License v1.0. It is free for research, personal, non-commercial use, and commercial use by entities with annual gross revenue not exceeding 1,000,000 PLN. Larger commercial users require a separate commercial license from Medalion Technology P.S.A.

Made by

TheThelion.ai Research Group.

Project lead: Aleskander Obuchowski

Special thanks to:

  • Maciej Gierczak
  • Kinga Marszałkowska
  • Mikołaj Badocha
Downloads last month
25
Safetensors
Model size
0.8B params
Tensor type
F16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for lion-ai/eskulap-asr-turbo-beta

Finetuned
(531)
this model

Datasets used to train lion-ai/eskulap-asr-turbo-beta

Space using lion-ai/eskulap-asr-turbo-beta 1