--- license: cc-by-nc-4.0 datasets: - NandemoGHS/Galgame_Gemini_Captions language: - ja base_model: - NandemoGHS/Anime-Llasa-3B pipeline_tag: text-to-speech --- # Anime-Llasa-3B-Captions ## Overview This is Anime-Llasa-3B-Captions, a Text-to-Speech (TTS) model fine-tuned for Japanese, based on [NandemoGHS/Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B). This version has been further fine-tuned with additional data, incorporating detailed audio metadata generated by Gemini 2.5 Pro. ## What's New: Fine-Tuning with Audio Metadata The key improvement in this model is its training methodology. I used Gemini 2.5 Pro to generate detailed metadata (captions, speaker profiles, emotions, etc.) for the audio data. The model was then fine-tuned on this dataset, learning to associate text with these rich descriptive tags. This allows for highly controllable speech synthesis by specifying desired audio characteristics in the prompt. ## How to Use: Controlling Speech Generation You can control the generated speech in two main ways: ### 1. Using System Prompt Metadata You can guide the speech synthesis by providing specific tags in the system prompt. The model expects the following format (note: `emotion` tags are in English, while others should be in Japanese): * **`caption`**: (Required) A general description of the audio content. * **`emotion`**: Emotion tag (e.g., `angry`, `sad`, `happy`, `serious`). * **`profile`**: Speaker profile (e.g., `若い女性声`, `大人の男性声`). * **`mood`**: Mood (e.g., `恥ずかしさ`, `悲しみ`). * **`speed`**: Speaking speed (e.g., `ゆっくり`, `速い`). * **`prosody`**: Prosody/Rhythm (e.g., `震え声`, `平坦`). * **`pitch_timbre`**: Pitch/Timbre (e.g., `高め`, `低め`, `息多め`). * **`style`**: Style (e.g., `ナレーション風`, `会話調`). * **`notes`**: Special notes (距離感、ブレスなど). ### 2. Using In-Text Tags (Full-Width Parentheses) Additionally, you can control the speech style directly within the transcription text by using full-width Japanese parentheses `( )`. For example, adding `(囁き)` (whisper) to the text will prompt the model to generate that part of the speech in a whispering voice. **Example Input Text:** `「これはテストです。(囁き)聞こえますか?」` ## Demo For detailed usage instructions and to try the model, please see the Hugging Face Space: [**Anime-Llasa-3B-Captions-Demo**](https://huggingface.co/spaces/OmniAICreator/Anime-Llasa-3B-Captions-Demo) ### Limitations Please note that due to limitations in the amount and quality of the training data, **the model cannot be controlled perfectly**. The generated speech may not always reflect the specified tags precisely. ## Training Data The dataset used for this fine-tuning, which includes the Gemini 2.5 Pro generated captions, is available here: [NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions) ## Old Versions * [Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B) ## License This model is licensed under **CC-BY-NC-4.0**. Additionally, as this model includes outputs from Gemini 2.5 Pro in its training data, **any use that competes with Gemini is prohibited.**