Add Darwin TTS compatibility repack without sample WAVs

Browse files

Files changed (13) hide show

README.md +212 -0
config.json +167 -0
darwin_tts_blend.py +101 -0
generation_config.json +12 -0
merges.txt +0 -0
model.safetensors +3 -0
preprocessor_config.json +6 -0
speech_tokenizer/config.json +94 -0
speech_tokenizer/configuration.json +1 -0
speech_tokenizer/model.safetensors +3 -0
speech_tokenizer/preprocessor_config.json +10 -0
tokenizer_config.json +316 -0
vocab.json +0 -0

README.md CHANGED Viewed

@@ -1,3 +1,215 @@
 ---
 license: apache-2.0
 ---

 ---
+base_model:
+- FINAL-Bench/Darwin-TTS-1.7B-Cross
+- Qwen/Qwen3-TTS-12Hz-1.7B-Base
+- Qwen/Qwen3-1.7B
+language:
+- ko
+- en
+- ja
+- zh
+- de
+- fr
+- ru
+- pt
+- es
+- it
 license: apache-2.0
+pipeline_tag: text-to-speech
+tags:
+- tts
+- text-to-speech
+- darwin
+- qwen3
+- qwen3-tts
+- voice-cloning
+- compatibility-fix
 ---
+# Darwin-TTS-1.7B-Cross — Qwen3-TTS compatibility repack
+This repository is a compatibility repack of [`FINAL-Bench/Darwin-TTS-1.7B-Cross`](https://huggingface.co/FINAL-Bench/Darwin-TTS-1.7B-Cross).
+The original Darwin checkpoint appears to omit the `speech_tokenizer/` directory required by the standard `qwen-tts` loader. This repack adds the missing `speech_tokenizer/` files from [`Qwen/Qwen3-TTS-12Hz-1.7B-Base`](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base).
+No model blending, training, fine-tuning, or behavioral changes were performed in this repack. The purpose is only to make the model load with:
+```python
+from qwen_tts import Qwen3TTSModel
+model = Qwen3TTSModel.from_pretrained("zeropointnine/Darwin-TTS-1.7B-Cross-Qwen3Tokenizer")
+```
+# Provenance
+- Main model weights and model card: FINAL-Bench/Darwin-TTS-1.7B-Cross
+- Added tokenizer assets: Qwen/Qwen3-TTS-12Hz-1.7B-Base
+- License: Apache 2.0, matching the upstream model cards.
+### Original Darwin-TTS-1.7B-Cross model card follows below:
+# 🧬 Darwin-TTS-1.7B-Cross
+**World's first cross-modal FFN transfer from LLM to TTS — emotion-enhanced speech synthesis without any training.**
+This model is a cross-modal application of the Darwin Family framework, introduced in the paper: [Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning](https://huggingface.co/papers/2605.14386).
+**Authors:** Taebong Kim, Youngsik Hong, Minsik Kim, Sunyoung Choi, Jaewon Jang, Junghoon Shin, Minseo Kim.
+> Darwin-TTS blends 3% of [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) (LLM) FFN weights into [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) (TTS) talker module. No training, no data, no GPU hours — just weight-space arithmetic.
+## Key Discovery
+| Blend (α) | Emotion | Quality | Status |
+|-----------|---------|---------|--------|
+| 0% | Baseline | Normal | Original Qwen3-TTS |
+| 1% | No change | Normal | Too subtle |
+| **3%** | **Emotion appears** | **Normal** | **★ This model (default)** |
+| 5% | Emotion intensified | Normal | ★★ Max stable |
+| 10% | Broken | Failed | Infinite generation |
+## Why It Works
+Qwen3-1.7B (LLM) and Qwen3-TTS-1.7B's talker share **100% identical architecture**:
+```
+                    Qwen3-1.7B (LLM)    Qwen3-TTS talker    Match
+hidden_size         2048                 2048                ✅
+intermediate_size   6144                 6144                ✅
+num_hidden_layers   28                   28                  ✅
+num_attention_heads 16                   16                  ✅
+num_key_value_heads 8                    8                   ✅
+```
+This means **zero SVD, zero truncation, zero layer mapping** — pure 1:1 lerp blending across all 84 FFN tensors (gate_proj, up_proj, down_proj × 28 layers).
+## Architecture
+```
+Qwen3-TTS-1.7B (4-module structure):
+┌─────────────────────────────────────────────────────┐
+│ talker (28L Qwen3 LM backbone)                      │
+│   └── 84 FFN tensors blended with LLM (α=3%)       │ ← MODIFIED
+│       └── talker.model.layers.N.mlp.{gate,up,down}  │
+├─────────────────────────────────────────────────────┤
+│ code_predictor (5L, h=1024)                          │ ← UNTOUCHED
+├─────────────────────────────────────────────────────┤
+│ speech_tokenizer (12Hz RVQ codec)                    │ ← UNTOUCHED
+├─────────────────────────────────────────────────────┤
+│ encoder/decoder (audio waveform)                     │ ← UNTOUCHED
+└─────────────────────────────────────────────────────┘
+FFN Source: Qwen3-1.7B (LLM)
+└── model.layers.N.mlp.{gate,up,down}_proj.weight
+    └── Key mapping: model.layers.N → talker.model.layers.N (1:1)
+```
+Only the talker's FFN weights are modified. The code_predictor, speech_tokenizer, and encoder/decoder remain 100% original — preserving the audio codec pipeline entirely.
+## Quick Start
+### Option 1: Load pre-blended weights (this model)
+```python
+from qwen_tts import Qwen3TTSModel
+import torch
+# Load Darwin-TTS-1.7B-Cross (α=3% pre-blended)
+model = Qwen3TTSModel.from_pretrained(
+    "FINAL-Bench/Darwin-TTS-1.7B-Cross",
+    device_map="cuda:0",
+    dtype=torch.bfloat16
+)
+# Synthesize
+wavs, sr = model.generate_voice_clone(
+    text="안녕하세요, 저는 다윈 인공지능입니다!",
+    ref_audio="your_voice.wav",
+    ref_text="ref",
+    x_vector_only_mode=True
+)
+```
+### Option 2: Custom blend ratio (runtime blending)
+```python
+from qwen_tts import Qwen3TTSModel
+model = Qwen3TTSModel.from_pretrained("FINAL-Bench/Darwin-TTS-1.7B-Cross")
+wavs, sr = model.generate_voice_clone(
+    text="정말 기쁜 소식이에요!",
+    ref_audio="voice.wav",
+    ref_text="ref",
+    x_vector_only_mode=True
+)
+```
+### CLI
+```bash
+python darwin_tts_blend.py --alpha 3 --text "Hello, Darwin!" --ref voice.wav --output speech.wav
+```
+## Installation
+```bash
+pip install torch qwen-tts safetensors soundfile huggingface_hub
+```
+## Research Background
+### The Problem
+Cross-modal capability transfer (e.g., adding emotion to TTS) traditionally requires:
+- Thousands of hours of emotional speech data
+- Hundreds of GPU hours for training
+- Careful data curation and annotation
+### The Darwin Approach
+Darwin's evolutionary merge framework, originally developed for LLM merging (Darwin LLM V7 achieved GPQA Diamond 86.9%, World #5), is extended to cross-modal transfer:
+1. **Find architecture-compatible models** across modalities (LLM ↔ TTS)
+2. **Blend FFN weights** at low ratios (3~5%) using simple lerp
+3. **Preserve modality-specific components** (audio codec, tokenizer)
+### Key Findings
+1. **Cross-modal FFN transfer works** — LLM's language understanding patterns enhance TTS emotional expressiveness
+2. **Sweet spot is 3~5%** — TTS is far more sensitive than LLM merging (which tolerates 7~93%)
+3. **Same backbone is required** — Qwen3 × Qwen3 succeeded; cross-backbone merges (e.g., Llama) failed.
+4. **10%+ destroys TTS** — LLM's "continue generating tokens" pattern overrides the TTS stop signal, causing 655-second outputs
+5. **Bidirectional potential** — LLM + TTS FFN may enable "Speaking LLM" (the GPT-4o direction)
+## Model Details
+- **Model type**: Text-to-Speech (cross-modal FFN blended)
+- **Base models**: Qwen3-TTS-1.7B-Base + Qwen3-1.7B (3% FFN)
+- **Parameters**: ~2.1B
+- **Languages**: Korean, English, Japanese, Chinese + 6 more
+- **License**: Apache 2.0
+- **Blend ratio**: α=0.03 (3%)
+- **FFN tensors modified**: 84 / 976 total (8.6%)
+- **Build time**: ~2 minutes (no training)
+## Citation
+If you find this work useful in your research, please cite:
+```bibtex
+@article{kim2026darwin,
+  title={Darwin Family: MRI-Trust-Weighted Evolutionary Merging for Training-Free Scaling of Language-Model Reasoning},
+  author={Kim, Taebong and Hong, Youngsik and Kim, Minsik and Choi, Sunyoung and Jang, Jaewon and Shin, Junghoon and Kim, Minseo},
+  journal={arXiv preprint arXiv:2605.14386},
+  year={2026}
+}
+```
+## Credits
+**[VIDRAFT](https://vidraft.nwr)** (비드래프트) — Darwin Evolutionary Merge Framework
+Built on [Qwen3-TTS-1.7B](https://huggingface.co/Qwen/Qwen3-TTS-12Hz-1.7B-Base) and [Qwen3-1.7B](https://huggingface.co/Qwen/Qwen3-1.7B) by Alibaba Cloud (Apache 2.0).
+## Related
+- [Darwin-27B-Opus](https://huggingface.co/FINAL-Bench/Darwin-27B-Opus) — Darwin LLM Flagship
+- [FINAL Bench](https://huggingface.co/FINAL-Bench) — Text AGI Benchmark
+- [Darwin Evolutionary Merge Framework](https://huggingface.co/FINAL-Bench) — CMA-ES + FFN crossbreeding

config.json ADDED Viewed

	@@ -0,0 +1,167 @@

+{
+  "architectures": [
+    "Qwen3TTSForConditionalGeneration"
+  ],
+  "assistant_token_id": 77091,
+  "im_end_token_id": 151645,
+  "im_start_token_id": 151644,
+  "tts_bos_token_id": 151672,
+  "tts_eos_token_id": 151673,
+  "tts_pad_token_id": 151671,
+  "model_type": "qwen3_tts",
+  "tokenizer_type": "qwen3_tts_tokenizer_12hz",
+  "tts_model_size": "1b7",
+  "tts_model_type": "base",
+  "speaker_encoder_config": {
+    "enc_dim": 2048,
+    "sample_rate": 24000
+  },
+  "talker_config": {
+    "attention_bias": false,
+    "attention_dropout": 0,
+    "code_predictor_config": {
+      "_name_or_path": "",
+      "add_cross_attention": false,
+      "architectures": null,
+      "attention_bias": false,
+      "attention_dropout": 0,
+      "bad_words_ids": null,
+      "begin_suppress_tokens": null,
+      "bos_token_id": null,
+      "chunk_size_feed_forward": 0,
+      "cross_attention_hidden_size": null,
+      "decoder_start_token_id": null,
+      "diversity_penalty": 0.0,
+      "do_sample": false,
+      "early_stopping": false,
+      "encoder_no_repeat_ngram_size": 0,
+      "eos_token_id": null,
+      "exponential_decay_length_penalty": null,
+      "finetuning_task": null,
+      "forced_bos_token_id": null,
+      "forced_eos_token_id": null,
+      "head_dim": 128,
+      "hidden_act": "silu",
+      "hidden_size": 1024,
+      "id2label": {
+        "0": "LABEL_0",
+        "1": "LABEL_1"
+      },
+      "initializer_range": 0.02,
+      "intermediate_size": 3072,
+      "is_decoder": false,
+      "is_encoder_decoder": false,
+      "label2id": {
+        "LABEL_0": 0,
+        "LABEL_1": 1
+      },
+      "layer_types": [
+        "full_attention",
+        "full_attention",
+        "full_attention",
+        "full_attention",
+        "full_attention"
+      ],
+      "length_penalty": 1.0,
+      "max_length": 20,
+      "max_position_embeddings": 65536,
+      "max_window_layers": 28,
+      "min_length": 0,
+      "model_type": "qwen3_tts_talker_code_predictor",
+      "no_repeat_ngram_size": 0,
+      "num_attention_heads": 16,
+      "num_beam_groups": 1,
+      "num_beams": 1,
+      "num_code_groups": 16,
+      "num_hidden_layers": 5,
+      "num_key_value_heads": 8,
+      "num_return_sequences": 1,
+      "output_attentions": false,
+      "output_hidden_states": false,
+      "output_scores": false,
+      "pad_token_id": null,
+      "prefix": null,
+      "problem_type": null,
+      "pruned_heads": {},
+      "remove_invalid_values": false,
+      "repetition_penalty": 1.0,
+      "return_dict": true,
+      "return_dict_in_generate": false,
+      "rms_norm_eps": 1e-06,
+      "rope_scaling": null,
+      "rope_theta": 1000000,
+      "sep_token_id": null,
+      "sliding_window": null,
+      "suppress_tokens": null,
+      "task_specific_params": null,
+      "temperature": 1.0,
+      "tf_legacy_loss": false,
+      "tie_encoder_decoder": false,
+      "tie_word_embeddings": false,
+      "tokenizer_class": null,
+      "top_k": 50,
+      "top_p": 1.0,
+      "dtype": null,
+      "torchscript": false,
+      "typical_p": 1.0,
+      "use_bfloat16": false,
+      "use_cache": true,
+      "use_sliding_window": false,
+      "vocab_size": 2048
+    },
+    "codec_bos_id": 2149,
+    "codec_eos_token_id": 2150,
+    "codec_think_id": 2154,
+    "codec_language_id": {
+        "chinese": 2055,
+        "english": 2050,
+        "german": 2053,
+        "italian": 2070,
+        "portuguese": 2071,
+        "spanish": 2054,
+        "japanese": 2058,
+        "korean": 2064,
+        "french": 2061,
+        "russian": 2069
+    },
+    "codec_nothink_id": 2155,
+    "codec_pad_id": 2148,
+    "codec_think_bos_id": 2156,
+    "codec_think_eos_id": 2157,
+    "spk_id": {
+    },
+    "spk_is_dialect": {
+    },
+    "head_dim": 128,
+    "hidden_act": "silu",
+    "hidden_size": 2048,
+    "initializer_range": 0.02,
+    "intermediate_size": 6144,
+    "max_position_embeddings": 32768,
+    "model_type": "qwen3_tts_talker",
+    "num_attention_heads": 16,
+    "num_code_groups": 16,
+    "num_hidden_layers": 28,
+    "num_key_value_heads": 8,
+    "position_id_per_seconds": 13,
+    "rms_norm_eps": 1e-06,
+    "rope_scaling": {
+      "interleaved": true,
+      "mrope_section": [
+        24,
+        20,
+        20
+      ],
+      "rope_type": "default",
+      "type": "default"
+    },
+    "rope_theta": 1000000,
+    "sliding_window": null,
+    "text_hidden_size": 2048,
+    "text_vocab_size": 151936,
+    "use_cache": true,
+    "use_sliding_window": false,
+    "vocab_size": 3072
+  },
+  "transformers_version": "4.57.3"
+}

darwin_tts_blend.py ADDED Viewed

	@@ -0,0 +1,101 @@

+"""
+Darwin-TTS-1.7B-Cross: Cross-Modal LLM→TTS FFN Blending
+=========================================================
+World's first cross-modal FFN transfer from LLM to TTS.
+No training. 84 FFN tensors. Shape 100% match.
+Usage:
+    python darwin_tts_blend.py --alpha 3 --text "안녕하세요!"
+    python darwin_tts_blend.py --alpha 5 --ref voice.wav --text "Hello!"
+Alpha guide:
+    0  = Original Qwen3-TTS (no blending)
+    1  = Subtle (barely noticeable)
+    3  = Recommended (emotion appears) ★
+    5  = Maximum stable (emotion intensified) ★★
+    10 = BROKEN (do not use)
+"""
+import argparse
+import torch
+import numpy as np
+import soundfile as sf
+from pathlib import Path
+from safetensors import safe_open
+def load_llm_ffn(model_id="Qwen/Qwen3-1.7B"):
+    """Load FFN weights from Qwen3-1.7B LLM."""
+    from huggingface_hub import snapshot_download
+    path = snapshot_download(model_id, ignore_patterns=["*.bin", "*.ot", "*.msgpack"])
+    ffn = {}
+    for f in sorted(Path(path).rglob("*.safetensors")):
+        with safe_open(str(f), framework="pt") as s:
+            for k in s.keys():
+                if any(x in k for x in ["gate_proj", "up_proj", "down_proj"]):
+                    ffn[k] = s.get_tensor(k)
+    print(f"Loaded {len(ffn)} LLM FFN tensors")
+    return ffn
+def blend_tts(alpha=0.03, tts_model="Qwen/Qwen3-TTS-12Hz-1.7B-Base"):
+    """
+    Load TTS model and blend LLM FFN into talker.
+    Args:
+        alpha: Blend ratio (0.0 to 0.05 recommended, default 0.03)
+        tts_model: TTS model ID or path
+    Returns:
+        Blended Qwen3TTSModel ready for inference
+    """
+    from qwen_tts import Qwen3TTSModel
+    print(f"Loading TTS: {tts_model}")
+    model = Qwen3TTSModel.from_pretrained(
+        tts_model, device_map="cuda:0", dtype=torch.bfloat16
+    )
+    if alpha > 0:
+        llm_ffn = load_llm_ffn()
+        cnt = 0
+        for n, p in model.model.named_parameters():
+            if "talker" not in n or "code_predictor" in n:
+                continue
+            if not any(x in n for x in ["gate_proj", "up_proj", "down_proj"]):
+                continue
+            llm_key = n.replace("talker.", "")
+            if llm_key in llm_ffn:
+                with torch.no_grad():
+                    p.lerp_(llm_ffn[llm_key].to(p.device, p.dtype), alpha)
+                cnt += 1
+        print(f"Blended {cnt} FFN tensors (alpha={alpha}, shape 100% match)")
+    return model
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description="Darwin-TTS: LLM→TTS FFN Blending")
+    parser.add_argument("--alpha", type=int, default=3,
+                        help="Blend %% (0=original, 3=recommended, 5=max stable)")
+    parser.add_argument("--text", type=str,
+                        default="안녕하세요, 저는 다윈 인공지능입니다.")
+    parser.add_argument("--ref", type=str, default=None,
+                        help="Reference audio for voice cloning")
+    parser.add_argument("--output", type=str, default="darwin_output.wav")
+    args = parser.parse_args()
+    if args.ref is None:
+        args.ref = "/tmp/_darwin_ref.wav"
+        sf.write(args.ref,
+                 (0.1 * np.sin(2 * np.pi * 200 * np.linspace(0, 3, 72000))
+                  ).astype(np.float32), 24000)
+        print("Using default sine reference (provide --ref for better quality)")
+    model = blend_tts(alpha=args.alpha / 100.0)
+    wavs, sr = model.generate_voice_clone(
+        text=args.text, ref_audio=args.ref,
+        ref_text="ref", x_vector_only_mode=True
+    )
+    wav = wavs[0].cpu().numpy() if hasattr(wavs[0], "cpu") else np.array(wavs[0])
+    sf.write(args.output, wav, sr)
+    print(f"Saved: {args.output} ({len(wav)/sr:.1f}s)")

generation_config.json ADDED Viewed

	@@ -0,0 +1,12 @@

+{
+  "do_sample": true,
+  "repetition_penalty": 1.05,
+  "temperature": 0.9,
+  "top_p": 1.0,
+  "top_k": 50,
+  "subtalker_dosample": true,
+  "subtalker_temperature": 0.9,
+  "subtalker_top_p": 1.0,
+  "subtalker_top_k": 50,
+  "max_new_tokens": 8192
+}

merges.txt ADDED Viewed

The diff for this file is too large to render. See raw diff

model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:22231d6695626fa4bf1dcd561fcdf97d962b6257a533e06e27ef4fa06d884444
+size 4539707356

preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,6 @@

+{
+  "padding_side": "left",
+  "padding_value": 0.0,
+  "processor_class": "Qwen3TTSProcessor",
+  "return_attention_mask": true
+}

speech_tokenizer/config.json ADDED Viewed

	@@ -0,0 +1,94 @@

+{
+  "architectures": [
+    "Qwen3TTSTokenizerV2Model"
+  ],
+  "model_type": "qwen3_tts_tokenizer_12hz",
+  "encoder_valid_num_quantizers": 16,
+  "input_sample_rate": 24000,
+  "output_sample_rate": 24000,
+  "decode_upsample_rate": 1920,
+  "encode_downsample_rate": 1920,
+  "decoder_config": {
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "latent_dim": 1024,
+    "codebook_dim": 512,
+    "codebook_size": 2048,
+    "decoder_dim": 1536,
+    "hidden_act": "silu",
+    "hidden_size": 512,
+    "intermediate_size": 1024,
+    "layer_scale_initial_scale": 0.01,
+    "max_position_embeddings": 8000,
+    "head_dim": 64,
+    "num_attention_heads": 16,
+    "num_hidden_layers": 8,
+    "num_key_value_heads": 16,
+    "num_quantizers": 16,
+    "num_semantic_quantizers": 1,
+    "rms_norm_eps": 1e-05,
+    "rope_theta": 10000,
+    "semantic_codebook_size": 4096,
+    "sliding_window": 72,
+    "upsample_rates": [
+      8,
+      5,
+      4,
+      3
+    ],
+    "upsampling_ratios": [
+      2,
+      2
+    ],
+    "vector_quantization_hidden_dimension": 512
+  },
+  "encoder_config": {
+    "_frame_rate": 12.5,
+    "attention_bias": false,
+    "attention_dropout": 0.0,
+    "audio_channels": 1,
+    "codebook_dim": 256,
+    "codebook_size": 2048,
+    "compress": 2,
+    "dilation_growth_rate": 2,
+    "dtype": "float32",
+    "head_dim": 64,
+    "hidden_act": "gelu",
+    "hidden_size": 512,
+    "initializer_range": 0.02,
+    "intermediate_size": 2048,
+    "kernel_size": 7,
+    "last_kernel_size": 3,
+    "layer_scale_initial_scale": 0.01,
+    "max_position_embeddings": 8000,
+    "norm_eps": 1e-05,
+    "normalize": false,
+    "num_attention_heads": 8,
+    "num_filters": 64,
+    "num_hidden_layers": 8,
+    "num_key_value_heads": 8,
+    "num_quantizers": 32,
+    "num_residual_layers": 1,
+    "num_semantic_quantizers": 1,
+    "pad_mode": "constant",
+    "residual_kernel_size": 3,
+    "rope_theta": 10000.0,
+    "sampling_rate": 24000,
+    "sliding_window": 250,
+    "transformers_version": "4.57.0.dev0",
+    "trim_right_ratio": 1.0,
+    "upsample_groups": 512,
+    "upsampling_ratios": [
+      8,
+      6,
+      5,
+      4
+    ],
+    "use_cache": false,
+    "use_causal_conv": true,
+    "use_conv_shortcut": false,
+    "use_streaming": false,
+    "vector_quantization_hidden_dimension": 256
+  },
+  "transformers_version": "4.57.3"
+}

speech_tokenizer/configuration.json ADDED Viewed

	@@ -0,0 +1 @@


1	+ {"framework": "pytorch", "task": "feature-extraction", "allow_remote": true}

speech_tokenizer/model.safetensors ADDED Viewed

	@@ -0,0 +1,3 @@

+version https://git-lfs.github.com/spec/v1
+oid sha256:836b7b357f5ea43e889936a3709af68dfe3751881acefe4ecf0dbd30ba571258
+size 682293092

speech_tokenizer/preprocessor_config.json ADDED Viewed

	@@ -0,0 +1,10 @@

+{
+  "chunk_length_s": null,
+  "feature_extractor_type": "EncodecFeatureExtractor",
+  "feature_size": 1,
+  "overlap": null,
+  "padding_side": "right",
+  "padding_value": 0.0,
+  "return_attention_mask": true,
+  "sampling_rate": 24000
+}

tokenizer_config.json ADDED Viewed

	@@ -0,0 +1,316 @@

+{
+  "add_bos_token": false,
+  "add_prefix_space": false,
+  "added_tokens_decoder": {
+    "151643": {
+      "content": "<|endoftext|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151644": {
+      "content": "<|im_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151645": {
+      "content": "<|im_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151646": {
+      "content": "<|object_ref_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151647": {
+      "content": "<|object_ref_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151648": {
+      "content": "<|box_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151649": {
+      "content": "<|box_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151650": {
+      "content": "<|quad_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151651": {
+      "content": "<|quad_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151652": {
+      "content": "<|vision_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151653": {
+      "content": "<|vision_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151654": {
+      "content": "<|vision_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151655": {
+      "content": "<|image_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151656": {
+      "content": "<|video_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151657": {
+      "content": "<tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151658": {
+      "content": "</tool_call>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151659": {
+      "content": "<|fim_prefix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151660": {
+      "content": "<|fim_middle|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151661": {
+      "content": "<|fim_suffix|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151662": {
+      "content": "<|fim_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151663": {
+      "content": "<|repo_name|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151664": {
+      "content": "<|file_sep|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151665": {
+      "content": "<tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151666": {
+      "content": "</tool_response>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151667": {
+      "content": "<think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151668": {
+      "content": "</think>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": false
+    },
+    "151669": {
+      "content": "<|audio_start|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151670": {
+      "content": "<|audio_end|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151671": {
+      "content": "<tts_pad>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151672": {
+      "content": "<tts_text_bos>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151673": {
+      "content": "<tts_text_eod>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151674": {
+      "content": "<tts_text_bos_single>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    },
+    "151675": {
+      "content": "<|audio_pad|>",
+      "lstrip": false,
+      "normalized": false,
+      "rstrip": false,
+      "single_word": false,
+      "special": true
+    }
+  },
+  "additional_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>",
+    "<|audio_start|>",
+    "<|audio_end|>",
+    "<tts_pad>",
+    "<tts_text_bos>",
+    "<tts_text_bos_single>",
+    "<|audio_pad|>"
+  ],
+  "extra_special_tokens": {
+    "image_token": "<|image_pad|>",
+    "audio_token": "<|audio_pad|>",
+    "video_token": "<|video_pad|>",
+    "vision_bos_token": "<|vision_start|>",
+    "vision_eos_token": "<|vision_end|>",
+    "audio_bos_token": "<|audio_start|>",
+    "audio_eos_token": "<|audio_end|>"
+  },
+  "bos_token": null,
+  "clean_up_tokenization_spaces": false,
+  "eos_token": "<|im_end|>",
+  "errors": "replace",
+  "model_max_length": 131072,
+  "pad_token": "<|endoftext|>",
+  "split_special_tokens": false,
+  "tokenizer_class": "Qwen2Tokenizer",
+  "unk_token": null,
+  "image_token": "<|image_pad|>",
+  "audio_token": "<|audio_pad|>",
+  "video_token": "<|video_pad|>",
+  "vision_bos_token": "<|vision_start|>",
+  "vision_eos_token": "<|vision_end|>",
+  "audio_bos_token": "<|audio_start|>",
+  "audio_eos_token": "<|audio_end|>"
+}

vocab.json ADDED Viewed

The diff for this file is too large to render. See raw diff