Text-to-Speech
Safetensors
Japanese
llama
File size: 3,294 Bytes
b9df5b0
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
---
license: cc-by-nc-4.0
datasets:
- NandemoGHS/Galgame_Gemini_Captions
language:
- ja
base_model:
- NandemoGHS/Anime-Llasa-3B
pipeline_tag: text-to-speech
---

# Anime-Llasa-3B-Captions

## Overview

This is Anime-Llasa-3B-Captions, a Text-to-Speech (TTS) model fine-tuned for Japanese, based on [NandemoGHS/Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B).

This version has been further fine-tuned with additional data, incorporating detailed audio metadata generated by Gemini 2.5 Pro.

## What's New: Fine-Tuning with Audio Metadata

The key improvement in this model is its training methodology. I used Gemini 2.5 Pro to generate detailed metadata (captions, speaker profiles, emotions, etc.) for the audio data. The model was then fine-tuned on this dataset, learning to associate text with these rich descriptive tags.

This allows for highly controllable speech synthesis by specifying desired audio characteristics in the prompt.

## How to Use: Controlling Speech Generation

You can control the generated speech in two main ways:

### 1. Using System Prompt Metadata

You can guide the speech synthesis by providing specific tags in the system prompt. The model expects the following format (note: `emotion` tags are in English, while others should be in Japanese):

* **`caption`**: (Required) A general description of the audio content.
* **`emotion`**: Emotion tag (e.g., `angry`, `sad`, `happy`, `serious`).
* **`profile`**: Speaker profile (e.g., `若い女性声`, `大人の男性声`).
* **`mood`**: Mood (e.g., `恥ずかしさ`, `悲しみ`).
* **`speed`**: Speaking speed (e.g., `ゆっくり`, `速い`).
* **`prosody`**: Prosody/Rhythm (e.g., `震え声`, `平坦`).
* **`pitch_timbre`**: Pitch/Timbre (e.g., `高め`, `低め`, `息多め`).
* **`style`**: Style (e.g., `ナレーション風`, `会話調`).
* **`notes`**: Special notes (距離感、ブレスなど).

### 2. Using In-Text Tags (Full-Width Parentheses)

Additionally, you can control the speech style directly within the transcription text by using full-width Japanese parentheses `( )`.

For example, adding `(囁き)` (whisper) to the text will prompt the model to generate that part of the speech in a whispering voice.

**Example Input Text:**
`「これはテストです。(囁き)聞こえますか?」`

## Demo

For detailed usage instructions and to try the model, please see the Hugging Face Space:

[**Anime-Llasa-3B-Captions-Demo**](https://huggingface.co/spaces/OmniAICreator/Anime-Llasa-3B-Captions-Demo)

### Limitations

Please note that due to limitations in the amount and quality of the training data, **the model cannot be controlled perfectly**. The generated speech may not always reflect the specified tags precisely.

## Training Data

The dataset used for this fine-tuning, which includes the Gemini 2.5 Pro generated captions, is available here:

[NandemoGHS/Galgame_Gemini_Captions](https://huggingface.co/datasets/NandemoGHS/Galgame_Gemini_Captions)

## Old Versions

* [Anime-Llasa-3B](https://huggingface.co/NandemoGHS/Anime-Llasa-3B)

## License

This model is licensed under **CC-BY-NC-4.0**.

Additionally, as this model includes outputs from Gemini 2.5 Pro in its training data, **any use that competes with Gemini is prohibited.**