File size: 3,294 Bytes
b9df5b0 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 | ---
license: cc-by-nc-4.0
datasets:
- NandemoGHS/Galgame_Gemini_Captions
language:
- ja
base_model:
- NandemoGHS/Anime-Llasa-3B
pipeline_tag: text-to-speech
---
# Anime-Llasa-3B-Captions
## Overview
This is Anime-Llasa-3B-Captions, a Text-to-Speech (TTS) model fine-tuned for Japanese, based on [NandemoGHS/Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B).
This version has been further fine-tuned with additional data, incorporating detailed audio metadata generated by Gemini 2.5 Pro.
## What's New: Fine-Tuning with Audio Metadata
The key improvement in this model is its training methodology. I used Gemini 2.5 Pro to generate detailed metadata (captions, speaker profiles, emotions, etc.) for the audio data. The model was then fine-tuned on this dataset, learning to associate text with these rich descriptive tags.
This allows for highly controllable speech synthesis by specifying desired audio characteristics in the prompt.
## How to Use: Controlling Speech Generation
You can control the generated speech in two main ways:
### 1. Using System Prompt Metadata
You can guide the speech synthesis by providing specific tags in the system prompt. The model expects the following format (note: `emotion` tags are in English, while others should be in Japanese):
* **`caption`**: (Required) A general description of the audio content.
* **`emotion`**: Emotion tag (e.g., `angry`, `sad`, `happy`, `serious`).
* **`profile`**: Speaker profile (e.g., `若い女性声`, `大人の男性声`).
* **`mood`**: Mood (e.g., `恥ずかしさ`, `悲しみ`).
* **`speed`**: Speaking speed (e.g., `ゆっくり`, `速い`).
* **`prosody`**: Prosody/Rhythm (e.g., `震え声`, `平坦`).
* **`pitch_timbre`**: Pitch/Timbre (e.g., `高め`, `低め`, `息多め`).
* **`style`**: Style (e.g., `ナレーション風`, `会話調`).
* **`notes`**: Special notes (距離感、ブレスなど).
### 2. Using In-Text Tags (Full-Width Parentheses)
Additionally, you can control the speech style directly within the transcription text by using full-width Japanese parentheses `( )`.
For example, adding `(囁き)` (whisper) to the text will prompt the model to generate that part of the speech in a whispering voice.
**Example Input Text:**
`「これはテストです。(囁き)聞こえますか?」`
## Demo
For detailed usage instructions and to try the model, please see the Hugging Face Space:
[**Anime-Llasa-3B-Captions-Demo**](https://huggingface.co/spaces/OmniAICreator/Anime-Llasa-3B-Captions-Demo)
### Limitations
Please note that due to limitations in the amount and quality of the training data, **the model cannot be controlled perfectly**. The generated speech may not always reflect the specified tags precisely.
## Training Data
The dataset used for this fine-tuning, which includes the Gemini 2.5 Pro generated captions, is available here:
[NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions)
## Old Versions
* [Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B)
## License
This model is licensed under **CC-BY-NC-4.0**.
Additionally, as this model includes outputs from Gemini 2.5 Pro in its training data, **any use that competes with Gemini is prohibited.** |