This model was created for the On Top of Pasketti: Children’s Speech Recognition Challenge - Phonetic Track competition. It was trained on a large-scale dataset specifically designed for children's speech recognition.
Model is based on Qwen/Qwen3-ASR-1.7B.
- Local validation CER: 0.2794
- Public Leaderboard CER: 0.2795
- Private Leaderboard CER: 0.2806
Usage:
import torch
from qwen_asr import Qwen3ASRModel
def get_dynamic_batches(items):
total = len(items)
i = 0
while i < total:
if items[i]['audio_duration_sec'] > 200:
batch_size = 1
else:
batch_size = int(5000 / items[i]['audio_duration_sec']) + 1
batch_size = min(batch_size, 64)
yield items[i:i + batch_size]
i += batch_size
model = Qwen3ASRModel.from_pretrained(
"ZFTurbo/Qwen3-ASR-Children-Phonetic",
dtype=torch.bfloat16,
device_map="cuda:0",
max_inference_batch_size=64,
max_new_tokens=-1,
)
with torch.inference_mode():
for batch in get_dynamic_batches(items):
paths = []
languages = []
max_new_tokens = 0
for item in batch:
path = str(data_dir / item["audio_path"])
paths.append(path)
languages.append("English")
max_new_tokens = max(max_new_tokens, int(item["audio_duration_sec"] * 20))
print(
"Batch size:", len(batch),
"Duration:", batch[0]["audio_duration_sec"],
"Processed:", total,
"Max tokens:", max_new_tokens,
)
cur_time = time.time()
results = model.transcribe(
audio=paths,
language=languages, # can also be set to None for automatic language detection
return_time_stamps=False,
max_new_tokens=max_new_tokens,
)
predictions = {}
for i, r in enumerate(results):
predictions[i] = r.text
More usage examples: https://github.com/ZFTurbo/Children-Speech-Recognition-Challenge-Solution
- Downloads last month
- 43
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support
Model tree for ZFTurbo/Qwen3-ASR-Children-Phonetic
Base model
Qwen/Qwen3-ASR-1.7B