Whisper Large V3 Japanese Phone Accent

This is a Whisper model designed to transcribe Japanese speech into Katakana with pitch accent annotations. The model is built upon the whisper-large-v3-turbo and has been fine-tuned using a subset (1/20) of the Galgame-Speech dataset, as well as the jsut-5000 dataset.

Training Data:

Stage 1: Audio from the Galgame-Speech dataset was used. The text was converted into Katakana sequences with pitch accent annotations using pyopenjtalk.
Stage 2: JSUT-5000 dataset, using its original training set with pitch accent annotations. The data was split into 90% for training and 10% for evaluation.

Evaluation Results:

The model achieved a CER (Character Error Rate) of approximately 4% on the JSUT-5000 test set, which is an improvement over the 7% CER of pyopenjtalk.
Training only with Stage 1 resulted in a CER of 13%, with errors including specific misreadings and misclassification between on'yomi (音読) and kun'yomi (訓読) readings. This was improved in Stage 2.

We are currently seeking Japanese pitch accent annotated datasets. If you have such data, please reach out!

Downloads last month: 39

Safetensors

Model size

0.8B params

Tensor type

F32

Model tree for AkitoP/whisper-large-v3-japense-phone_accent

Base model

openai/whisper-large-v3

Finetuned

openai/whisper-large-v3-turbo

Finetuned

(540)

this model

AkitoP
/

whisper-large-v3-japense-phone_accent

Whisper Large V3 Japanese Phone Accent

Training Data:

Evaluation Results:

Model tree for AkitoP/whisper-large-v3-japense-phone_accent

Datasets used to train AkitoP/whisper-large-v3-japense-phone_accent

Spaces using AkitoP/whisper-large-v3-japense-phone_accent 2