---
license: mit
language:
  - en
  - zh
tags:
  - text-to-speech
  - tts
  - speech-synthesis
  - vibevoice
  - bitsandbytes
  - 4bit
  - quantized
library_name: transformers
base_model: vibevoice/VibeVoice-7B
pipeline_tag: text-to-speech
---

# VibeVoice 7B - 4-bit Quantized (bitsandbytes NF4)

This is a **4-bit quantized** version of [VibeVoice 7B](https://huggingface.co/vibevoice/VibeVoice-7B) using bitsandbytes NF4 quantization.

## Model Details

| Property | Value |
|----------|-------|
| Base Model | vibevoice/VibeVoice-7B |
| Quantization | bitsandbytes NF4 (4-bit) |
| VRAM Usage | ~6.2 GB |
| Model Size | ~6.2 GB on disk |
| Sample Rate | 24kHz |

### VRAM Comparison

| Mode | VRAM | Reduction |
|------|------|-----------|
| Full bfloat16 | ~17 GB | baseline |
| ao-int8 | ~9.4 GB | 45% |
| **bnb-4bit** | **~6.2 GB** | **64%** |

## Quick Start

### Installation

```bash
pip install transformers bitsandbytes torch torchaudio
pip install git+https://github.com/vibevoice-community/VibeVoice.git
```

### Usage

```python
import torch
from transformers import BitsAndBytesConfig
from vibevoice.modular.modeling_vibevoice_inference import VibeVoiceForConditionalGenerationInference
from vibevoice.processor.vibevoice_processor import VibeVoiceProcessor

# Load quantized model
model_id = "marksverdhai/vibevoice-7b-bnb-4bit"

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
    bnb_4bit_quant_type="nf4",
)

model = VibeVoiceForConditionalGenerationInference.from_pretrained(
    model_id,
    device_map={"": 0},  # Load on GPU 0
    quantization_config=bnb_config,
    torch_dtype=torch.bfloat16,
)
processor = VibeVoiceProcessor.from_pretrained(model_id)

model.eval()
model.set_ddpm_inference_steps(num_steps=10)

# Generate speech
text = "Speaker 1: Hello! This is VibeVoice, a state-of-the-art text-to-speech model."

inputs = processor(
    text=[text],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=None,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        generation_config={"do_sample": False},
        verbose=False,
        is_prefill=False,
    )

# Get audio
audio = outputs.speech_outputs[0].squeeze().cpu()
sample_rate = 24000

# Save to file
import torchaudio
torchaudio.save("output.wav", audio.unsqueeze(0), sample_rate)
```

### Voice Cloning

```python
# With voice reference
inputs = processor(
    text=["Speaker 1: Hello, I can clone any voice!"],
    voice_samples=[["path/to/reference.wav"]],
    padding=True,
    return_tensors="pt",
    return_attention_mask=True,
)
inputs = {k: v.to("cuda") for k, v in inputs.items() if torch.is_tensor(v)}

with torch.no_grad():
    outputs = model.generate(
        **inputs,
        cfg_scale=1.3,
        tokenizer=processor.tokenizer,
        is_prefill=True,  # Enable voice cloning
    )
```

## Quality Verification

This model was tested with Whisper transcription to verify output quality:

| Test Sentence | WER |
|---------------|-----|
| "Hello, this is a test." | 0% |
| "The quick brown fox jumps over the lazy dog." | 0% |
| "Good morning, how are you today?" | 0% |
| "Machine learning is transforming technology." | 0% |
| "Please remember to save your work frequently." | 0% |

**All test sentences achieved 0% Word Error Rate**, matching the full-precision model quality.

## Quantization Details

This model uses bitsandbytes NF4 quantization:

- **NF4 (NormalFloat4)**: Optimized 4-bit data type for neural network weights
- **Double Quantization**: Nested quantization for additional memory savings
- **Compute dtype**: bfloat16 for computations

The quantization is applied to the Qwen2 LLM backbone while preserving full precision for:
- Audio tokenizers (semantic and acoustic)
- Diffusion head

## Limitations

- Requires CUDA GPU with bitsandbytes support
- Slightly slower inference than full precision (~1.3x)
- Longer model load time (~65s vs ~24s)

## Citation

```bibtex
@misc{vibevoice2024,
  title={VibeVoice: Emotion-Aware Text-to-Speech},
  author={VibeVoice Team},
  year={2024},
  url={https://github.com/vibevoice-community/VibeVoice}
}
```

## License

MIT