--- license: mit language: - en - zh tags: - text-to-speech - tts - speech-synthesis - vibevoice - bitsandbytes - 4bit - quantized library_name: transformers base_model: vibevoice/VibeVoice-7B pipeline_tag: text-to-speech --- # VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4) This is a **4-bit quantized** version of [VibeVoice 7B](https://huggingface.co/vibevoice/VibeVoice-7B) using bitsandbytes NF4 quantization. ## Model Details | Property | Value | |----------|-------| | Base Model | vibevoice/VibeVoice-7B | | Quantization | bitsandbytes NF4 (4-bit) | | VRAM Usage | ~6.2 GB | | Model Size | ~6.2 GB on disk | | Sample Rate | 24kHz | ### VRAM Comparison | Mode | VRAM | Reduction | |------|------|-----------| | Full bfloat16 | ~17 GB | baseline | | ao-int8 | ~9.4 GB | 45% | | **bnb-4bit** | **~6.2 GB** | **64%** | ## Quick Start ### Installation ```bash pip install transformers bitsandbytes torch torchaudio pip install git+https://github.com/vibevoice-community/VibeVoice.git ``` ### Usage ```python import torch from transformers import BitsAndBytesConfig from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor # Load quantized model model_id = "marksverdhai/vibevoice-7b-bnb-4bit" bnb_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16, bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4", ) model = VibeVoiceForConditionalGenerationInference.from_pretrained( model_id, device_map={"": 0}, # Load on GPU 0 quantization_config=bnb_config, torch_dtype=torch.bfloat16, ) processor = VibeVoiceProcessor.from_pretrained(model_id) model.eval() model.set_ddpm_inference_steps(num_steps=10) # Generate speech text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model." inputs = processor( text=[text], padding=True, return_tensors="pt", return_attention_mask=True, ) inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=None, cfg_scale=1.3, tokenizer=processor.tokenizer, generation_config={"do_sample": False}, verbose=False, is_prefill=False, ) # Get audio audio = outputs.speech_outputs[0].squeeze().cpu() sample_rate = 24000 # Save to file import torchaudio torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate) ``` ### Voice Cloning ```python # With voice reference inputs = processor( text=["Speaker 1: Hello, I can clone any voice!"], voice_samples=[["path/to/reference.wav"]], padding=True, return_tensors="pt", return_attention_mask=True, ) inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)} with torch.no_grad(): outputs = model.generate( **inputs, cfg_scale=1.3, tokenizer=processor.tokenizer, is_prefill=True, # Enable voice cloning ) ``` ## Quality Verification This model was tested with Whisper transcription to verify output quality: | Test Sentence | WER | |---------------|-----| | "Hello, this is a test." | 0% | | "The quick brown fox jumps over the lazy dog." | 0% | | "Good morning, how are you today?" | 0% | | "Machine learning is transforming technology." | 0% | | "Please remember to save your work frequently." | 0% | **All test sentences achieved 0% Word Error Rate**, matching the full-precision model quality. ## Quantization Details This model uses bitsandbytes NF4 quantization: - **NF4 (NormalFloat4)**: Optimized 4-bit data type for neural network weights - **Double Quantization**: Nested quantization for additional memory savings - **Compute dtype**: bfloat16 for computations The quantization is applied to the Qwen2 LLM backbone while preserving full precision for: - Audio tokenizers (semantic and acoustic) - Diffusion head ## Limitations - Requires CUDA GPU with bitsandbytes support - Slightly slower inference than full precision (~1.3x) - Longer model load time (~65s vs ~24s) ## Citation ```bibtex @misc{vibevoice2024, title={VibeVoice: Emotion-Aware Text-to-Speech}, author={VibeVoice Team}, year={2024}, url={https://github.com/vibevoice-community/VibeVoice} } ``` ## License MIT