Commit ·
6b26697
1
Parent(s): 82248a5
Add Darwin TTS compatibility repack without sample WAVs
Browse files- README.md +212 -0
- config.json +167 -0
- darwin_tts_blend.py +101 -0
- generation_config.json +12 -0
- merges.txt +0 -0
- model.safetensors +3 -0
- preprocessor_config.json +6 -0
- speech_tokenizer/config.json +94 -0
- speech_tokenizer/configuration.json +1 -0
- speech_tokenizer/model.safetensors +3 -0
- speech_tokenizer/preprocessor_config.json +10 -0
- tokenizer_config.json +316 -0
- vocab.json +0 -0
README.md
CHANGED
|
@@ -1,3 +1,215 @@
|
|
| 1 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 2 |
license: apache-2.0
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 3 |
---
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
---
|
| 2 |
+
base_model:
|
| 3 |
+
- FINAL-Bench/Darwin-TTS-1.7B-Cross
|
| 4 |
+
- Qwen/Qwen3-TTS-12Hz-1.7B-Base
|
| 5 |
+
- Qwen/Qwen3-1.7B
|
| 6 |
+
language:
|
| 7 |
+
- ko
|
| 8 |
+
- en
|
| 9 |
+
- ja
|
| 10 |
+
- zh
|
| 11 |
+
- de
|
| 12 |
+
- fr
|
| 13 |
+
- ru
|
| 14 |
+
- pt
|
| 15 |
+
- es
|
| 16 |
+
- it
|
| 17 |
license: apache-2.0
|
| 18 |
+
pipeline_tag: text-to-speech
|
| 19 |
+
tags:
|
| 20 |
+
- tts
|
| 21 |
+
- text-to-speech
|
| 22 |
+
- darwin
|
| 23 |
+
- qwen3
|
| 24 |
+
- qwen3-tts
|
| 25 |
+
- voice-cloning
|
| 26 |
+
- compatibility-fix
|
| 27 |
---
|
| 28 |
+
|
| 29 |
+
# Darwin-TTS-1.7B-Cross — Qwen3-TTS compatibility repack
|
| 30 |
+
|
| 31 |
+
This repository is a compatibility repack of [`FINAL-Bench/Darwin-TTS-1.7B-Cross`](https://huggingface.co/FINAL-Bench/Darwin-TTS-1.7B-Cross).
|
| 32 |
+
|
| 33 |
+
The original Darwin checkpoint appears to omit the `speech_tokenizer/` directory required by the standard `qwen-tts` loader. This repack adds the missing `speech_tokenizer/` files from [`Qwen/Qwen3-TTS-12Hz-1.7B-Base`](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base).
|
| 34 |
+
|
| 35 |
+
No model blending, training, fine-tuning, or behavioral changes were performed in this repack. The purpose is only to make the model load with:
|
| 36 |
+
|
| 37 |
+
```python
|
| 38 |
+
from qwen_tts import Qwen3TTSModel
|
| 39 |
+
|
| 40 |
+
model = Qwen3TTSModel.from_pretrained("zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer")
|
| 41 |
+
```
|
| 42 |
+
|
| 43 |
+
# Provenance
|
| 44 |
+
|
| 45 |
+
- Main model weights and model card: FINAL-Bench/Darwin-TTS-1.7B-Cross
|
| 46 |
+
- Added tokenizer assets: Qwen/Qwen3-TTS-12Hz-1.7B-Base
|
| 47 |
+
- License: Apache 2.0, matching the upstream model cards.
|
| 48 |
+
|
| 49 |
+
### Original Darwin-TTS-1.7B-Cross model card follows below:
|
| 50 |
+
|
| 51 |
+
# 🧬 Darwin-TTS-1.7B-Cross
|
| 52 |
+
|
| 53 |
+
**World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.**
|
| 54 |
+
|
| 55 |
+
This model is a cross-modal application of the Darwin Family framework, introduced in the paper: [Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning](https://huggingface.co/papers/2605.14386).
|
| 56 |
+
|
| 57 |
+
**Authors:** Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
|
| 58 |
+
|
| 59 |
+
> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.
|
| 60 |
+
|
| 61 |
+
## Key Discovery
|
| 62 |
+
|
| 63 |
+
| Blend (α) | Emotion | Quality | Status |
|
| 64 |
+
|-----------|---------|---------|--------|
|
| 65 |
+
| 0% | Baseline | Normal | Original Qwen3-TTS |
|
| 66 |
+
| 1% | No change | Normal | Too subtle |
|
| 67 |
+
| **3%** | **Emotion appears** | **Normal** | **★ This model (default)** |
|
| 68 |
+
| 5% | Emotion intensified | Normal | ★★ Max stable |
|
| 69 |
+
| 10% | Broken | Failed | Infinite generation |
|
| 70 |
+
|
| 71 |
+
## Why It Works
|
| 72 |
+
|
| 73 |
+
Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share **100% identical architecture**:
|
| 74 |
+
|
| 75 |
+
```
|
| 76 |
+
Qwen3-1.7B (LLM) Qwen3-TTS talker Match
|
| 77 |
+
hidden_size 2048 2048 ✅
|
| 78 |
+
intermediate_size 6144 6144 ✅
|
| 79 |
+
num_hidden_layers 28 28 ✅
|
| 80 |
+
num_attention_heads 16 16 ✅
|
| 81 |
+
num_key_value_heads 8 8 ✅
|
| 82 |
+
```
|
| 83 |
+
|
| 84 |
+
This means **zero SVD, zero truncation, zero layer mapping** — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).
|
| 85 |
+
|
| 86 |
+
## Architecture
|
| 87 |
+
|
| 88 |
+
```
|
| 89 |
+
Qwen3-TTS-1.7B (4-module structure):
|
| 90 |
+
┌─────────────────────────────────────────────────────┐
|
| 91 |
+
│ talker (28L Qwen3 LM backbone) │
|
| 92 |
+
│ └── 84 FFN tensors blended with LLM (α=3%) │ ← MODIFIED
|
| 93 |
+
│ └── talker.model.layers.N.mlp.{gate,up,down} │
|
| 94 |
+
├─────────────────────────────────────────────────────┤
|
| 95 |
+
│ code_predictor (5L, h=1024) │ ← UNTOUCHED
|
| 96 |
+
├─────────────────────────────────────────────────────┤
|
| 97 |
+
│ speech_tokenizer (12Hz RVQ codec) │ ← UNTOUCHED
|
| 98 |
+
├─────────────────────────────────────────────────────┤
|
| 99 |
+
│ encoder/decoder (audio waveform) │ ← UNTOUCHED
|
| 100 |
+
└─────────────────────────────────────────────────────┘
|
| 101 |
+
|
| 102 |
+
FFN Source: Qwen3-1.7B (LLM)
|
| 103 |
+
└── model.layers.N.mlp.{gate,up,down}_proj.weight
|
| 104 |
+
└── Key mapping: model.layers.N → talker.model.layers.N (1:1)
|
| 105 |
+
```
|
| 106 |
+
|
| 107 |
+
Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.
|
| 108 |
+
|
| 109 |
+
## Quick Start
|
| 110 |
+
|
| 111 |
+
### Option 1: Load pre-blended weights (this model)
|
| 112 |
+
|
| 113 |
+
```python
|
| 114 |
+
from qwen_tts import Qwen3TTSModel
|
| 115 |
+
import torch
|
| 116 |
+
|
| 117 |
+
# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
|
| 118 |
+
model = Qwen3TTSModel.from_pretrained(
|
| 119 |
+
"FINAL-Bench/Darwin-TTS-1.7B-Cross",
|
| 120 |
+
device_map="cuda:0",
|
| 121 |
+
dtype=torch.bfloat16
|
| 122 |
+
)
|
| 123 |
+
|
| 124 |
+
# Synthesize
|
| 125 |
+
wavs, sr = model.generate_voice_clone(
|
| 126 |
+
text="안녕하세요, 저는 다윈 인공지능입니다!",
|
| 127 |
+
ref_audio="your_voice.wav",
|
| 128 |
+
ref_text="ref",
|
| 129 |
+
x_vector_only_mode=True
|
| 130 |
+
)
|
| 131 |
+
```
|
| 132 |
+
|
| 133 |
+
### Option 2: Custom blend ratio (runtime blending)
|
| 134 |
+
|
| 135 |
+
```python
|
| 136 |
+
from qwen_tts import Qwen3TTSModel
|
| 137 |
+
model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
|
| 138 |
+
wavs, sr = model.generate_voice_clone(
|
| 139 |
+
text="정말 기쁜 소식이에요!",
|
| 140 |
+
ref_audio="voice.wav",
|
| 141 |
+
ref_text="ref",
|
| 142 |
+
x_vector_only_mode=True
|
| 143 |
+
)
|
| 144 |
+
```
|
| 145 |
+
|
| 146 |
+
### CLI
|
| 147 |
+
|
| 148 |
+
```bash
|
| 149 |
+
python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
|
| 150 |
+
```
|
| 151 |
+
|
| 152 |
+
## Installation
|
| 153 |
+
|
| 154 |
+
```bash
|
| 155 |
+
pip install torch qwen-tts safetensors soundfile huggingface_hub
|
| 156 |
+
```
|
| 157 |
+
|
| 158 |
+
## Research Background
|
| 159 |
+
|
| 160 |
+
### The Problem
|
| 161 |
+
Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
|
| 162 |
+
- Thousands of hours of emotional speech data
|
| 163 |
+
- Hundreds of GPU hours for training
|
| 164 |
+
- Careful data curation and annotation
|
| 165 |
+
|
| 166 |
+
### The Darwin Approach
|
| 167 |
+
Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:
|
| 168 |
+
|
| 169 |
+
1. **Find architecture-compatible models** across modalities (LLM ↔ TTS)
|
| 170 |
+
2. **Blend FFN weights** at low ratios (3~5%) using simple lerp
|
| 171 |
+
3. **Preserve modality-specific components** (audio codec, tokenizer)
|
| 172 |
+
|
| 173 |
+
### Key Findings
|
| 174 |
+
|
| 175 |
+
1. **Cross-modal FFN transfer works** — LLM's language understanding patterns enhance TTS emotional expressiveness
|
| 176 |
+
2. **Sweet spot is 3~5%** — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
|
| 177 |
+
3. **Same backbone is required** — Qwen3 × Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
|
| 178 |
+
4. **10%+ destroys TTS** — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
|
| 179 |
+
5. **Bidirectional potential** — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
|
| 180 |
+
|
| 181 |
+
## Model Details
|
| 182 |
+
|
| 183 |
+
- **Model type**: Text-to-Speech (cross-modal FFN blended)
|
| 184 |
+
- **Base models**: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
|
| 185 |
+
- **Parameters**: ~2.1B
|
| 186 |
+
- **Languages**: Korean, English, Japanese, Chinese + 6 more
|
| 187 |
+
- **License**: Apache 2.0
|
| 188 |
+
- **Blend ratio**: α=0.03 (3%)
|
| 189 |
+
- **FFN tensors modified**: 84 / 976 total (8.6%)
|
| 190 |
+
- **Build time**: ~2 minutes (no training)
|
| 191 |
+
|
| 192 |
+
## Citation
|
| 193 |
+
|
| 194 |
+
If you find this work useful in your research, please cite:
|
| 195 |
+
|
| 196 |
+
```bibtex
|
| 197 |
+
@article{kim2026darwin,
|
| 198 |
+
title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
|
| 199 |
+
author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
|
| 200 |
+
journal={arXiv preprint arXiv:2605.14386},
|
| 201 |
+
year={2026}
|
| 202 |
+
}
|
| 203 |
+
```
|
| 204 |
+
|
| 205 |
+
## Credits
|
| 206 |
+
|
| 207 |
+
**[VIDRAFT](https://vidraft.nwr)** (비드래프트) — Darwin Evolutionary Merge Framework
|
| 208 |
+
|
| 209 |
+
Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
|
| 210 |
+
|
| 211 |
+
## Related
|
| 212 |
+
|
| 213 |
+
- [Darwin-27B-Opus](https://huggingface.co/FINAL-Bench/Darwin-27B-Opus) — Darwin LLM Flagship
|
| 214 |
+
- [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
|
| 215 |
+
- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding
|
config.json
ADDED
|
@@ -0,0 +1,167 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"Qwen3TTSForConditionalGeneration"
|
| 4 |
+
],
|
| 5 |
+
"assistant_token_id": 77091,
|
| 6 |
+
"im_end_token_id": 151645,
|
| 7 |
+
"im_start_token_id": 151644,
|
| 8 |
+
"tts_bos_token_id": 151672,
|
| 9 |
+
"tts_eos_token_id": 151673,
|
| 10 |
+
"tts_pad_token_id": 151671,
|
| 11 |
+
"model_type": "qwen3_tts",
|
| 12 |
+
"tokenizer_type": "qwen3_tts_tokenizer_12hz",
|
| 13 |
+
"tts_model_size": "1b7",
|
| 14 |
+
"tts_model_type": "base",
|
| 15 |
+
"speaker_encoder_config": {
|
| 16 |
+
"enc_dim": 2048,
|
| 17 |
+
"sample_rate": 24000
|
| 18 |
+
},
|
| 19 |
+
"talker_config": {
|
| 20 |
+
"attention_bias": false,
|
| 21 |
+
"attention_dropout": 0,
|
| 22 |
+
"code_predictor_config": {
|
| 23 |
+
"_name_or_path": "",
|
| 24 |
+
"add_cross_attention": false,
|
| 25 |
+
"architectures": null,
|
| 26 |
+
"attention_bias": false,
|
| 27 |
+
"attention_dropout": 0,
|
| 28 |
+
"bad_words_ids": null,
|
| 29 |
+
"begin_suppress_tokens": null,
|
| 30 |
+
"bos_token_id": null,
|
| 31 |
+
"chunk_size_feed_forward": 0,
|
| 32 |
+
"cross_attention_hidden_size": null,
|
| 33 |
+
"decoder_start_token_id": null,
|
| 34 |
+
"diversity_penalty": 0.0,
|
| 35 |
+
"do_sample": false,
|
| 36 |
+
"early_stopping": false,
|
| 37 |
+
"encoder_no_repeat_ngram_size": 0,
|
| 38 |
+
"eos_token_id": null,
|
| 39 |
+
"exponential_decay_length_penalty": null,
|
| 40 |
+
"finetuning_task": null,
|
| 41 |
+
"forced_bos_token_id": null,
|
| 42 |
+
"forced_eos_token_id": null,
|
| 43 |
+
"head_dim": 128,
|
| 44 |
+
"hidden_act": "silu",
|
| 45 |
+
"hidden_size": 1024,
|
| 46 |
+
"id2label": {
|
| 47 |
+
"0": "LABEL_0",
|
| 48 |
+
"1": "LABEL_1"
|
| 49 |
+
},
|
| 50 |
+
"initializer_range": 0.02,
|
| 51 |
+
"intermediate_size": 3072,
|
| 52 |
+
"is_decoder": false,
|
| 53 |
+
"is_encoder_decoder": false,
|
| 54 |
+
"label2id": {
|
| 55 |
+
"LABEL_0": 0,
|
| 56 |
+
"LABEL_1": 1
|
| 57 |
+
},
|
| 58 |
+
"layer_types": [
|
| 59 |
+
"full_attention",
|
| 60 |
+
"full_attention",
|
| 61 |
+
"full_attention",
|
| 62 |
+
"full_attention",
|
| 63 |
+
"full_attention"
|
| 64 |
+
],
|
| 65 |
+
"length_penalty": 1.0,
|
| 66 |
+
"max_length": 20,
|
| 67 |
+
"max_position_embeddings": 65536,
|
| 68 |
+
"max_window_layers": 28,
|
| 69 |
+
"min_length": 0,
|
| 70 |
+
"model_type": "qwen3_tts_talker_code_predictor",
|
| 71 |
+
"no_repeat_ngram_size": 0,
|
| 72 |
+
"num_attention_heads": 16,
|
| 73 |
+
"num_beam_groups": 1,
|
| 74 |
+
"num_beams": 1,
|
| 75 |
+
"num_code_groups": 16,
|
| 76 |
+
"num_hidden_layers": 5,
|
| 77 |
+
"num_key_value_heads": 8,
|
| 78 |
+
"num_return_sequences": 1,
|
| 79 |
+
"output_attentions": false,
|
| 80 |
+
"output_hidden_states": false,
|
| 81 |
+
"output_scores": false,
|
| 82 |
+
"pad_token_id": null,
|
| 83 |
+
"prefix": null,
|
| 84 |
+
"problem_type": null,
|
| 85 |
+
"pruned_heads": {},
|
| 86 |
+
"remove_invalid_values": false,
|
| 87 |
+
"repetition_penalty": 1.0,
|
| 88 |
+
"return_dict": true,
|
| 89 |
+
"return_dict_in_generate": false,
|
| 90 |
+
"rms_norm_eps": 1e-06,
|
| 91 |
+
"rope_scaling": null,
|
| 92 |
+
"rope_theta": 1000000,
|
| 93 |
+
"sep_token_id": null,
|
| 94 |
+
"sliding_window": null,
|
| 95 |
+
"suppress_tokens": null,
|
| 96 |
+
"task_specific_params": null,
|
| 97 |
+
"temperature": 1.0,
|
| 98 |
+
"tf_legacy_loss": false,
|
| 99 |
+
"tie_encoder_decoder": false,
|
| 100 |
+
"tie_word_embeddings": false,
|
| 101 |
+
"tokenizer_class": null,
|
| 102 |
+
"top_k": 50,
|
| 103 |
+
"top_p": 1.0,
|
| 104 |
+
"dtype": null,
|
| 105 |
+
"torchscript": false,
|
| 106 |
+
"typical_p": 1.0,
|
| 107 |
+
"use_bfloat16": false,
|
| 108 |
+
"use_cache": true,
|
| 109 |
+
"use_sliding_window": false,
|
| 110 |
+
"vocab_size": 2048
|
| 111 |
+
},
|
| 112 |
+
"codec_bos_id": 2149,
|
| 113 |
+
"codec_eos_token_id": 2150,
|
| 114 |
+
"codec_think_id": 2154,
|
| 115 |
+
"codec_language_id": {
|
| 116 |
+
"chinese": 2055,
|
| 117 |
+
"english": 2050,
|
| 118 |
+
"german": 2053,
|
| 119 |
+
"italian": 2070,
|
| 120 |
+
"portuguese": 2071,
|
| 121 |
+
"spanish": 2054,
|
| 122 |
+
"japanese": 2058,
|
| 123 |
+
"korean": 2064,
|
| 124 |
+
"french": 2061,
|
| 125 |
+
"russian": 2069
|
| 126 |
+
},
|
| 127 |
+
"codec_nothink_id": 2155,
|
| 128 |
+
"codec_pad_id": 2148,
|
| 129 |
+
"codec_think_bos_id": 2156,
|
| 130 |
+
"codec_think_eos_id": 2157,
|
| 131 |
+
"spk_id": {
|
| 132 |
+
},
|
| 133 |
+
"spk_is_dialect": {
|
| 134 |
+
},
|
| 135 |
+
"head_dim": 128,
|
| 136 |
+
"hidden_act": "silu",
|
| 137 |
+
"hidden_size": 2048,
|
| 138 |
+
"initializer_range": 0.02,
|
| 139 |
+
"intermediate_size": 6144,
|
| 140 |
+
"max_position_embeddings": 32768,
|
| 141 |
+
"model_type": "qwen3_tts_talker",
|
| 142 |
+
"num_attention_heads": 16,
|
| 143 |
+
"num_code_groups": 16,
|
| 144 |
+
"num_hidden_layers": 28,
|
| 145 |
+
"num_key_value_heads": 8,
|
| 146 |
+
"position_id_per_seconds": 13,
|
| 147 |
+
"rms_norm_eps": 1e-06,
|
| 148 |
+
"rope_scaling": {
|
| 149 |
+
"interleaved": true,
|
| 150 |
+
"mrope_section": [
|
| 151 |
+
24,
|
| 152 |
+
20,
|
| 153 |
+
20
|
| 154 |
+
],
|
| 155 |
+
"rope_type": "default",
|
| 156 |
+
"type": "default"
|
| 157 |
+
},
|
| 158 |
+
"rope_theta": 1000000,
|
| 159 |
+
"sliding_window": null,
|
| 160 |
+
"text_hidden_size": 2048,
|
| 161 |
+
"text_vocab_size": 151936,
|
| 162 |
+
"use_cache": true,
|
| 163 |
+
"use_sliding_window": false,
|
| 164 |
+
"vocab_size": 3072
|
| 165 |
+
},
|
| 166 |
+
"transformers_version": "4.57.3"
|
| 167 |
+
}
|
darwin_tts_blend.py
ADDED
|
@@ -0,0 +1,101 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
"""
|
| 2 |
+
Darwin-TTS-1.7B-Cross: Cross-Modal LLM→TTS FFN Blending
|
| 3 |
+
=========================================================
|
| 4 |
+
World's first cross-modal FFN transfer from LLM to TTS.
|
| 5 |
+
No training. 84 FFN tensors. Shape 100% match.
|
| 6 |
+
|
| 7 |
+
Usage:
|
| 8 |
+
python darwin_tts_blend.py --alpha 3 --text "안녕하세요!"
|
| 9 |
+
python darwin_tts_blend.py --alpha 5 --ref voice.wav --text "Hello!"
|
| 10 |
+
|
| 11 |
+
Alpha guide:
|
| 12 |
+
0 = Original Qwen3-TTS (no blending)
|
| 13 |
+
1 = Subtle (barely noticeable)
|
| 14 |
+
3 = Recommended (emotion appears) ★
|
| 15 |
+
5 = Maximum stable (emotion intensified) ★★
|
| 16 |
+
10 = BROKEN (do not use)
|
| 17 |
+
"""
|
| 18 |
+
import argparse
|
| 19 |
+
import torch
|
| 20 |
+
import numpy as np
|
| 21 |
+
import soundfile as sf
|
| 22 |
+
from pathlib import Path
|
| 23 |
+
from safetensors import safe_open
|
| 24 |
+
|
| 25 |
+
|
| 26 |
+
def load_llm_ffn(model_id="Qwen/Qwen3-1.7B"):
|
| 27 |
+
"""Load FFN weights from Qwen3-1.7B LLM."""
|
| 28 |
+
from huggingface_hub import snapshot_download
|
| 29 |
+
path = snapshot_download(model_id, ignore_patterns=["*.bin", "*.ot", "*.msgpack"])
|
| 30 |
+
ffn = {}
|
| 31 |
+
for f in sorted(Path(path).rglob("*.safetensors")):
|
| 32 |
+
with safe_open(str(f), framework="pt") as s:
|
| 33 |
+
for k in s.keys():
|
| 34 |
+
if any(x in k for x in ["gate_proj", "up_proj", "down_proj"]):
|
| 35 |
+
ffn[k] = s.get_tensor(k)
|
| 36 |
+
print(f"Loaded {len(ffn)} LLM FFN tensors")
|
| 37 |
+
return ffn
|
| 38 |
+
|
| 39 |
+
|
| 40 |
+
def blend_tts(alpha=0.03, tts_model="Qwen/Qwen3-TTS-12Hz-1.7B-Base"):
|
| 41 |
+
"""
|
| 42 |
+
Load TTS model and blend LLM FFN into talker.
|
| 43 |
+
|
| 44 |
+
Args:
|
| 45 |
+
alpha: Blend ratio (0.0 to 0.05 recommended, default 0.03)
|
| 46 |
+
tts_model: TTS model ID or path
|
| 47 |
+
|
| 48 |
+
Returns:
|
| 49 |
+
Blended Qwen3TTSModel ready for inference
|
| 50 |
+
"""
|
| 51 |
+
from qwen_tts import Qwen3TTSModel
|
| 52 |
+
|
| 53 |
+
print(f"Loading TTS: {tts_model}")
|
| 54 |
+
model = Qwen3TTSModel.from_pretrained(
|
| 55 |
+
tts_model, device_map="cuda:0", dtype=torch.bfloat16
|
| 56 |
+
)
|
| 57 |
+
|
| 58 |
+
if alpha > 0:
|
| 59 |
+
llm_ffn = load_llm_ffn()
|
| 60 |
+
cnt = 0
|
| 61 |
+
for n, p in model.model.named_parameters():
|
| 62 |
+
if "talker" not in n or "code_predictor" in n:
|
| 63 |
+
continue
|
| 64 |
+
if not any(x in n for x in ["gate_proj", "up_proj", "down_proj"]):
|
| 65 |
+
continue
|
| 66 |
+
llm_key = n.replace("talker.", "")
|
| 67 |
+
if llm_key in llm_ffn:
|
| 68 |
+
with torch.no_grad():
|
| 69 |
+
p.lerp_(llm_ffn[llm_key].to(p.device, p.dtype), alpha)
|
| 70 |
+
cnt += 1
|
| 71 |
+
print(f"Blended {cnt} FFN tensors (alpha={alpha}, shape 100% match)")
|
| 72 |
+
|
| 73 |
+
return model
|
| 74 |
+
|
| 75 |
+
|
| 76 |
+
if __name__ == "__main__":
|
| 77 |
+
parser = argparse.ArgumentParser(description="Darwin-TTS: LLM→TTS FFN Blending")
|
| 78 |
+
parser.add_argument("--alpha", type=int, default=3,
|
| 79 |
+
help="Blend %% (0=original, 3=recommended, 5=max stable)")
|
| 80 |
+
parser.add_argument("--text", type=str,
|
| 81 |
+
default="안녕하세요, 저는 다윈 인공지능입니다.")
|
| 82 |
+
parser.add_argument("--ref", type=str, default=None,
|
| 83 |
+
help="Reference audio for voice cloning")
|
| 84 |
+
parser.add_argument("--output", type=str, default="darwin_output.wav")
|
| 85 |
+
args = parser.parse_args()
|
| 86 |
+
|
| 87 |
+
if args.ref is None:
|
| 88 |
+
args.ref = "/tmp/_darwin_ref.wav"
|
| 89 |
+
sf.write(args.ref,
|
| 90 |
+
(0.1 * np.sin(2 * np.pi * 200 * np.linspace(0, 3, 72000))
|
| 91 |
+
).astype(np.float32), 24000)
|
| 92 |
+
print("Using default sine reference (provide --ref for better quality)")
|
| 93 |
+
|
| 94 |
+
model = blend_tts(alpha=args.alpha / 100.0)
|
| 95 |
+
wavs, sr = model.generate_voice_clone(
|
| 96 |
+
text=args.text, ref_audio=args.ref,
|
| 97 |
+
ref_text="ref", x_vector_only_mode=True
|
| 98 |
+
)
|
| 99 |
+
wav = wavs[0].cpu().numpy() if hasattr(wavs[0], "cpu") else np.array(wavs[0])
|
| 100 |
+
sf.write(args.output, wav, sr)
|
| 101 |
+
print(f"Saved: {args.output} ({len(wav)/sr:.1f}s)")
|
generation_config.json
ADDED
|
@@ -0,0 +1,12 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"do_sample": true,
|
| 3 |
+
"repetition_penalty": 1.05,
|
| 4 |
+
"temperature": 0.9,
|
| 5 |
+
"top_p": 1.0,
|
| 6 |
+
"top_k": 50,
|
| 7 |
+
"subtalker_dosample": true,
|
| 8 |
+
"subtalker_temperature": 0.9,
|
| 9 |
+
"subtalker_top_p": 1.0,
|
| 10 |
+
"subtalker_top_k": 50,
|
| 11 |
+
"max_new_tokens": 8192
|
| 12 |
+
}
|
merges.txt
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|
model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:22231d6695626fa4bf1dcd561fcdf97d962b6257a533e06e27ef4fa06d884444
|
| 3 |
+
size 4539707356
|
preprocessor_config.json
ADDED
|
@@ -0,0 +1,6 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"padding_side": "left",
|
| 3 |
+
"padding_value": 0.0,
|
| 4 |
+
"processor_class": "Qwen3TTSProcessor",
|
| 5 |
+
"return_attention_mask": true
|
| 6 |
+
}
|
speech_tokenizer/config.json
ADDED
|
@@ -0,0 +1,94 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"architectures": [
|
| 3 |
+
"Qwen3TTSTokenizerV2Model"
|
| 4 |
+
],
|
| 5 |
+
"model_type": "qwen3_tts_tokenizer_12hz",
|
| 6 |
+
"encoder_valid_num_quantizers": 16,
|
| 7 |
+
"input_sample_rate": 24000,
|
| 8 |
+
"output_sample_rate": 24000,
|
| 9 |
+
"decode_upsample_rate": 1920,
|
| 10 |
+
"encode_downsample_rate": 1920,
|
| 11 |
+
"decoder_config": {
|
| 12 |
+
"attention_bias": false,
|
| 13 |
+
"attention_dropout": 0.0,
|
| 14 |
+
"latent_dim": 1024,
|
| 15 |
+
"codebook_dim": 512,
|
| 16 |
+
"codebook_size": 2048,
|
| 17 |
+
"decoder_dim": 1536,
|
| 18 |
+
"hidden_act": "silu",
|
| 19 |
+
"hidden_size": 512,
|
| 20 |
+
"intermediate_size": 1024,
|
| 21 |
+
"layer_scale_initial_scale": 0.01,
|
| 22 |
+
"max_position_embeddings": 8000,
|
| 23 |
+
"head_dim": 64,
|
| 24 |
+
"num_attention_heads": 16,
|
| 25 |
+
"num_hidden_layers": 8,
|
| 26 |
+
"num_key_value_heads": 16,
|
| 27 |
+
"num_quantizers": 16,
|
| 28 |
+
"num_semantic_quantizers": 1,
|
| 29 |
+
"rms_norm_eps": 1e-05,
|
| 30 |
+
"rope_theta": 10000,
|
| 31 |
+
"semantic_codebook_size": 4096,
|
| 32 |
+
"sliding_window": 72,
|
| 33 |
+
"upsample_rates": [
|
| 34 |
+
8,
|
| 35 |
+
5,
|
| 36 |
+
4,
|
| 37 |
+
3
|
| 38 |
+
],
|
| 39 |
+
"upsampling_ratios": [
|
| 40 |
+
2,
|
| 41 |
+
2
|
| 42 |
+
],
|
| 43 |
+
"vector_quantization_hidden_dimension": 512
|
| 44 |
+
},
|
| 45 |
+
"encoder_config": {
|
| 46 |
+
"_frame_rate": 12.5,
|
| 47 |
+
"attention_bias": false,
|
| 48 |
+
"attention_dropout": 0.0,
|
| 49 |
+
"audio_channels": 1,
|
| 50 |
+
"codebook_dim": 256,
|
| 51 |
+
"codebook_size": 2048,
|
| 52 |
+
"compress": 2,
|
| 53 |
+
"dilation_growth_rate": 2,
|
| 54 |
+
"dtype": "float32",
|
| 55 |
+
"head_dim": 64,
|
| 56 |
+
"hidden_act": "gelu",
|
| 57 |
+
"hidden_size": 512,
|
| 58 |
+
"initializer_range": 0.02,
|
| 59 |
+
"intermediate_size": 2048,
|
| 60 |
+
"kernel_size": 7,
|
| 61 |
+
"last_kernel_size": 3,
|
| 62 |
+
"layer_scale_initial_scale": 0.01,
|
| 63 |
+
"max_position_embeddings": 8000,
|
| 64 |
+
"norm_eps": 1e-05,
|
| 65 |
+
"normalize": false,
|
| 66 |
+
"num_attention_heads": 8,
|
| 67 |
+
"num_filters": 64,
|
| 68 |
+
"num_hidden_layers": 8,
|
| 69 |
+
"num_key_value_heads": 8,
|
| 70 |
+
"num_quantizers": 32,
|
| 71 |
+
"num_residual_layers": 1,
|
| 72 |
+
"num_semantic_quantizers": 1,
|
| 73 |
+
"pad_mode": "constant",
|
| 74 |
+
"residual_kernel_size": 3,
|
| 75 |
+
"rope_theta": 10000.0,
|
| 76 |
+
"sampling_rate": 24000,
|
| 77 |
+
"sliding_window": 250,
|
| 78 |
+
"transformers_version": "4.57.0.dev0",
|
| 79 |
+
"trim_right_ratio": 1.0,
|
| 80 |
+
"upsample_groups": 512,
|
| 81 |
+
"upsampling_ratios": [
|
| 82 |
+
8,
|
| 83 |
+
6,
|
| 84 |
+
5,
|
| 85 |
+
4
|
| 86 |
+
],
|
| 87 |
+
"use_cache": false,
|
| 88 |
+
"use_causal_conv": true,
|
| 89 |
+
"use_conv_shortcut": false,
|
| 90 |
+
"use_streaming": false,
|
| 91 |
+
"vector_quantization_hidden_dimension": 256
|
| 92 |
+
},
|
| 93 |
+
"transformers_version": "4.57.3"
|
| 94 |
+
}
|
speech_tokenizer/configuration.json
ADDED
|
@@ -0,0 +1 @@
|
|
|
|
|
|
|
| 1 |
+
{"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}
|
speech_tokenizer/model.safetensors
ADDED
|
@@ -0,0 +1,3 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
version https://git-lfs.github.com/spec/v1
|
| 2 |
+
oid sha256:836b7b357f5ea43e889936a3709af68dfe3751881acefe4ecf0dbd30ba571258
|
| 3 |
+
size 682293092
|
speech_tokenizer/preprocessor_config.json
ADDED
|
@@ -0,0 +1,10 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"chunk_length_s": null,
|
| 3 |
+
"feature_extractor_type": "EncodecFeatureExtractor",
|
| 4 |
+
"feature_size": 1,
|
| 5 |
+
"overlap": null,
|
| 6 |
+
"padding_side": "right",
|
| 7 |
+
"padding_value": 0.0,
|
| 8 |
+
"return_attention_mask": true,
|
| 9 |
+
"sampling_rate": 24000
|
| 10 |
+
}
|
tokenizer_config.json
ADDED
|
@@ -0,0 +1,316 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| 1 |
+
{
|
| 2 |
+
"add_bos_token": false,
|
| 3 |
+
"add_prefix_space": false,
|
| 4 |
+
"added_tokens_decoder": {
|
| 5 |
+
"151643": {
|
| 6 |
+
"content": "<|endoftext|>",
|
| 7 |
+
"lstrip": false,
|
| 8 |
+
"normalized": false,
|
| 9 |
+
"rstrip": false,
|
| 10 |
+
"single_word": false,
|
| 11 |
+
"special": true
|
| 12 |
+
},
|
| 13 |
+
"151644": {
|
| 14 |
+
"content": "<|im_start|>",
|
| 15 |
+
"lstrip": false,
|
| 16 |
+
"normalized": false,
|
| 17 |
+
"rstrip": false,
|
| 18 |
+
"single_word": false,
|
| 19 |
+
"special": true
|
| 20 |
+
},
|
| 21 |
+
"151645": {
|
| 22 |
+
"content": "<|im_end|>",
|
| 23 |
+
"lstrip": false,
|
| 24 |
+
"normalized": false,
|
| 25 |
+
"rstrip": false,
|
| 26 |
+
"single_word": false,
|
| 27 |
+
"special": true
|
| 28 |
+
},
|
| 29 |
+
"151646": {
|
| 30 |
+
"content": "<|object_ref_start|>",
|
| 31 |
+
"lstrip": false,
|
| 32 |
+
"normalized": false,
|
| 33 |
+
"rstrip": false,
|
| 34 |
+
"single_word": false,
|
| 35 |
+
"special": true
|
| 36 |
+
},
|
| 37 |
+
"151647": {
|
| 38 |
+
"content": "<|object_ref_end|>",
|
| 39 |
+
"lstrip": false,
|
| 40 |
+
"normalized": false,
|
| 41 |
+
"rstrip": false,
|
| 42 |
+
"single_word": false,
|
| 43 |
+
"special": true
|
| 44 |
+
},
|
| 45 |
+
"151648": {
|
| 46 |
+
"content": "<|box_start|>",
|
| 47 |
+
"lstrip": false,
|
| 48 |
+
"normalized": false,
|
| 49 |
+
"rstrip": false,
|
| 50 |
+
"single_word": false,
|
| 51 |
+
"special": true
|
| 52 |
+
},
|
| 53 |
+
"151649": {
|
| 54 |
+
"content": "<|box_end|>",
|
| 55 |
+
"lstrip": false,
|
| 56 |
+
"normalized": false,
|
| 57 |
+
"rstrip": false,
|
| 58 |
+
"single_word": false,
|
| 59 |
+
"special": true
|
| 60 |
+
},
|
| 61 |
+
"151650": {
|
| 62 |
+
"content": "<|quad_start|>",
|
| 63 |
+
"lstrip": false,
|
| 64 |
+
"normalized": false,
|
| 65 |
+
"rstrip": false,
|
| 66 |
+
"single_word": false,
|
| 67 |
+
"special": true
|
| 68 |
+
},
|
| 69 |
+
"151651": {
|
| 70 |
+
"content": "<|quad_end|>",
|
| 71 |
+
"lstrip": false,
|
| 72 |
+
"normalized": false,
|
| 73 |
+
"rstrip": false,
|
| 74 |
+
"single_word": false,
|
| 75 |
+
"special": true
|
| 76 |
+
},
|
| 77 |
+
"151652": {
|
| 78 |
+
"content": "<|vision_start|>",
|
| 79 |
+
"lstrip": false,
|
| 80 |
+
"normalized": false,
|
| 81 |
+
"rstrip": false,
|
| 82 |
+
"single_word": false,
|
| 83 |
+
"special": true
|
| 84 |
+
},
|
| 85 |
+
"151653": {
|
| 86 |
+
"content": "<|vision_end|>",
|
| 87 |
+
"lstrip": false,
|
| 88 |
+
"normalized": false,
|
| 89 |
+
"rstrip": false,
|
| 90 |
+
"single_word": false,
|
| 91 |
+
"special": true
|
| 92 |
+
},
|
| 93 |
+
"151654": {
|
| 94 |
+
"content": "<|vision_pad|>",
|
| 95 |
+
"lstrip": false,
|
| 96 |
+
"normalized": false,
|
| 97 |
+
"rstrip": false,
|
| 98 |
+
"single_word": false,
|
| 99 |
+
"special": true
|
| 100 |
+
},
|
| 101 |
+
"151655": {
|
| 102 |
+
"content": "<|image_pad|>",
|
| 103 |
+
"lstrip": false,
|
| 104 |
+
"normalized": false,
|
| 105 |
+
"rstrip": false,
|
| 106 |
+
"single_word": false,
|
| 107 |
+
"special": true
|
| 108 |
+
},
|
| 109 |
+
"151656": {
|
| 110 |
+
"content": "<|video_pad|>",
|
| 111 |
+
"lstrip": false,
|
| 112 |
+
"normalized": false,
|
| 113 |
+
"rstrip": false,
|
| 114 |
+
"single_word": false,
|
| 115 |
+
"special": true
|
| 116 |
+
},
|
| 117 |
+
"151657": {
|
| 118 |
+
"content": "<tool_call>",
|
| 119 |
+
"lstrip": false,
|
| 120 |
+
"normalized": false,
|
| 121 |
+
"rstrip": false,
|
| 122 |
+
"single_word": false,
|
| 123 |
+
"special": false
|
| 124 |
+
},
|
| 125 |
+
"151658": {
|
| 126 |
+
"content": "</tool_call>",
|
| 127 |
+
"lstrip": false,
|
| 128 |
+
"normalized": false,
|
| 129 |
+
"rstrip": false,
|
| 130 |
+
"single_word": false,
|
| 131 |
+
"special": false
|
| 132 |
+
},
|
| 133 |
+
"151659": {
|
| 134 |
+
"content": "<|fim_prefix|>",
|
| 135 |
+
"lstrip": false,
|
| 136 |
+
"normalized": false,
|
| 137 |
+
"rstrip": false,
|
| 138 |
+
"single_word": false,
|
| 139 |
+
"special": false
|
| 140 |
+
},
|
| 141 |
+
"151660": {
|
| 142 |
+
"content": "<|fim_middle|>",
|
| 143 |
+
"lstrip": false,
|
| 144 |
+
"normalized": false,
|
| 145 |
+
"rstrip": false,
|
| 146 |
+
"single_word": false,
|
| 147 |
+
"special": false
|
| 148 |
+
},
|
| 149 |
+
"151661": {
|
| 150 |
+
"content": "<|fim_suffix|>",
|
| 151 |
+
"lstrip": false,
|
| 152 |
+
"normalized": false,
|
| 153 |
+
"rstrip": false,
|
| 154 |
+
"single_word": false,
|
| 155 |
+
"special": false
|
| 156 |
+
},
|
| 157 |
+
"151662": {
|
| 158 |
+
"content": "<|fim_pad|>",
|
| 159 |
+
"lstrip": false,
|
| 160 |
+
"normalized": false,
|
| 161 |
+
"rstrip": false,
|
| 162 |
+
"single_word": false,
|
| 163 |
+
"special": false
|
| 164 |
+
},
|
| 165 |
+
"151663": {
|
| 166 |
+
"content": "<|repo_name|>",
|
| 167 |
+
"lstrip": false,
|
| 168 |
+
"normalized": false,
|
| 169 |
+
"rstrip": false,
|
| 170 |
+
"single_word": false,
|
| 171 |
+
"special": false
|
| 172 |
+
},
|
| 173 |
+
"151664": {
|
| 174 |
+
"content": "<|file_sep|>",
|
| 175 |
+
"lstrip": false,
|
| 176 |
+
"normalized": false,
|
| 177 |
+
"rstrip": false,
|
| 178 |
+
"single_word": false,
|
| 179 |
+
"special": false
|
| 180 |
+
},
|
| 181 |
+
"151665": {
|
| 182 |
+
"content": "<tool_response>",
|
| 183 |
+
"lstrip": false,
|
| 184 |
+
"normalized": false,
|
| 185 |
+
"rstrip": false,
|
| 186 |
+
"single_word": false,
|
| 187 |
+
"special": false
|
| 188 |
+
},
|
| 189 |
+
"151666": {
|
| 190 |
+
"content": "</tool_response>",
|
| 191 |
+
"lstrip": false,
|
| 192 |
+
"normalized": false,
|
| 193 |
+
"rstrip": false,
|
| 194 |
+
"single_word": false,
|
| 195 |
+
"special": false
|
| 196 |
+
},
|
| 197 |
+
"151667": {
|
| 198 |
+
"content": "<think>",
|
| 199 |
+
"lstrip": false,
|
| 200 |
+
"normalized": false,
|
| 201 |
+
"rstrip": false,
|
| 202 |
+
"single_word": false,
|
| 203 |
+
"special": false
|
| 204 |
+
},
|
| 205 |
+
"151668": {
|
| 206 |
+
"content": "</think>",
|
| 207 |
+
"lstrip": false,
|
| 208 |
+
"normalized": false,
|
| 209 |
+
"rstrip": false,
|
| 210 |
+
"single_word": false,
|
| 211 |
+
"special": false
|
| 212 |
+
},
|
| 213 |
+
"151669": {
|
| 214 |
+
"content": "<|audio_start|>",
|
| 215 |
+
"lstrip": false,
|
| 216 |
+
"normalized": false,
|
| 217 |
+
"rstrip": false,
|
| 218 |
+
"single_word": false,
|
| 219 |
+
"special": true
|
| 220 |
+
},
|
| 221 |
+
"151670": {
|
| 222 |
+
"content": "<|audio_end|>",
|
| 223 |
+
"lstrip": false,
|
| 224 |
+
"normalized": false,
|
| 225 |
+
"rstrip": false,
|
| 226 |
+
"single_word": false,
|
| 227 |
+
"special": true
|
| 228 |
+
},
|
| 229 |
+
"151671": {
|
| 230 |
+
"content": "<tts_pad>",
|
| 231 |
+
"lstrip": false,
|
| 232 |
+
"normalized": false,
|
| 233 |
+
"rstrip": false,
|
| 234 |
+
"single_word": false,
|
| 235 |
+
"special": true
|
| 236 |
+
},
|
| 237 |
+
"151672": {
|
| 238 |
+
"content": "<tts_text_bos>",
|
| 239 |
+
"lstrip": false,
|
| 240 |
+
"normalized": false,
|
| 241 |
+
"rstrip": false,
|
| 242 |
+
"single_word": false,
|
| 243 |
+
"special": true
|
| 244 |
+
},
|
| 245 |
+
"151673": {
|
| 246 |
+
"content": "<tts_text_eod>",
|
| 247 |
+
"lstrip": false,
|
| 248 |
+
"normalized": false,
|
| 249 |
+
"rstrip": false,
|
| 250 |
+
"single_word": false,
|
| 251 |
+
"special": true
|
| 252 |
+
},
|
| 253 |
+
"151674": {
|
| 254 |
+
"content": "<tts_text_bos_single>",
|
| 255 |
+
"lstrip": false,
|
| 256 |
+
"normalized": false,
|
| 257 |
+
"rstrip": false,
|
| 258 |
+
"single_word": false,
|
| 259 |
+
"special": true
|
| 260 |
+
},
|
| 261 |
+
"151675": {
|
| 262 |
+
"content": "<|audio_pad|>",
|
| 263 |
+
"lstrip": false,
|
| 264 |
+
"normalized": false,
|
| 265 |
+
"rstrip": false,
|
| 266 |
+
"single_word": false,
|
| 267 |
+
"special": true
|
| 268 |
+
}
|
| 269 |
+
},
|
| 270 |
+
"additional_special_tokens": [
|
| 271 |
+
"<|im_start|>",
|
| 272 |
+
"<|im_end|>",
|
| 273 |
+
"<|object_ref_start|>",
|
| 274 |
+
"<|object_ref_end|>",
|
| 275 |
+
"<|box_start|>",
|
| 276 |
+
"<|box_end|>",
|
| 277 |
+
"<|quad_start|>",
|
| 278 |
+
"<|quad_end|>",
|
| 279 |
+
"<|vision_start|>",
|
| 280 |
+
"<|vision_end|>",
|
| 281 |
+
"<|vision_pad|>",
|
| 282 |
+
"<|image_pad|>",
|
| 283 |
+
"<|video_pad|>",
|
| 284 |
+
"<|audio_start|>",
|
| 285 |
+
"<|audio_end|>",
|
| 286 |
+
"<tts_pad>",
|
| 287 |
+
"<tts_text_bos>",
|
| 288 |
+
"<tts_text_bos_single>",
|
| 289 |
+
"<|audio_pad|>"
|
| 290 |
+
],
|
| 291 |
+
"extra_special_tokens": {
|
| 292 |
+
"image_token": "<|image_pad|>",
|
| 293 |
+
"audio_token": "<|audio_pad|>",
|
| 294 |
+
"video_token": "<|video_pad|>",
|
| 295 |
+
"vision_bos_token": "<|vision_start|>",
|
| 296 |
+
"vision_eos_token": "<|vision_end|>",
|
| 297 |
+
"audio_bos_token": "<|audio_start|>",
|
| 298 |
+
"audio_eos_token": "<|audio_end|>"
|
| 299 |
+
},
|
| 300 |
+
"bos_token": null,
|
| 301 |
+
"clean_up_tokenization_spaces": false,
|
| 302 |
+
"eos_token": "<|im_end|>",
|
| 303 |
+
"errors": "replace",
|
| 304 |
+
"model_max_length": 131072,
|
| 305 |
+
"pad_token": "<|endoftext|>",
|
| 306 |
+
"split_special_tokens": false,
|
| 307 |
+
"tokenizer_class": "Qwen2Tokenizer",
|
| 308 |
+
"unk_token": null,
|
| 309 |
+
"image_token": "<|image_pad|>",
|
| 310 |
+
"audio_token": "<|audio_pad|>",
|
| 311 |
+
"video_token": "<|video_pad|>",
|
| 312 |
+
"vision_bos_token": "<|vision_start|>",
|
| 313 |
+
"vision_eos_token": "<|vision_end|>",
|
| 314 |
+
"audio_bos_token": "<|audio_start|>",
|
| 315 |
+
"audio_eos_token": "<|audio_end|>"
|
| 316 |
+
}
|
vocab.json
ADDED
|
The diff for this file is too large to render.
See raw diff
|
|
|