---
license: apache-2.0
base_model: openbmb/VoxCPM2
base_model_relation: quantized
tags:
  - mlx
  - tts
  - text-to-speech
  - voice-cloning
  - voice-design
  - multilingual
  - quantized
  - apple-silicon
library_name: mlx
pipeline_tag: text-to-speech
language:
  - en
  - zh
  - id
  - ja
  - ko
  - multilingual
---
# VoxCPM2 — MLX int8 (group-quantised)

8-bit MLX-compatible quantization for Apple Silicon.

MLX port of [openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2) — a 2B-parameter
multilingual diffusion-autoregressive TTS model with **48 kHz** studio-quality output,
voice cloning, and instruction-driven voice design.

Part of [soniqo.audio](https://soniqo.audio) — an on-device speech toolkit for
Apple Silicon. Consumed by the open-source
[`speech-swift`](https://github.com/soniqo/speech-swift) library
(module `VoxCPM2TTS`).

**Bundle size**: 2.95 GB

## Use cases

- [Speech generation](https://soniqo.audio/speech-generation/) — 48 kHz TTS with voice design and multilingual support.
- [Voice cloning](https://soniqo.audio/voice-cloning/) — reference-audio cloning + ultimate cloning (audio + transcript).
- [CLI reference](https://soniqo.audio/cli/) — `speech speak --engine voxcpm2 ...` flags.
- [Getting started](https://soniqo.audio/getting-started/) — install `speech-swift` on macOS / iOS.

## Variants

| Variant | Size | Notes |
|---------|------|-------|
| [bf16](https://huggingface.co/aufklarer/VoxCPM2-MLX-bf16)  | ~5.0 GB | Reference quality, no Linear quantization. |
| [int8](https://huggingface.co/aufklarer/VoxCPM2-MLX-int8) | ~3.0 GB | 8-bit group quantization. Mean rel-L2 0.53 % vs bf16. |
| [int4](https://huggingface.co/aufklarer/VoxCPM2-MLX-int4) | ~1.9 GB | 4-bit group quantization. Mean rel-L2 9.04 % vs bf16. |

## Capabilities

- **30 languages** including English, Chinese, Indonesian, Japanese, Korean
- **48 kHz output**
- **Zero-shot synthesis** — generate speech from text alone
- **Voice cloning** — clone a target speaker from a single reference clip
- **Voice design** — natural-language style control (e.g. *"young female voice, warm and gentle"*)
- **Ultimate cloning** — reference audio + transcript for prosody-preserving cloning
- **Streaming generation** — patch-level decoding for low-latency synthesis

## Quantization

- **Format**: MLX `QuantizedLinear`, 8 bits per element, group size 64,
  per-group scales and biases stored as float16.
- **What is quantized**: All `Linear` layers inside the LM backbones
  (`base_lm`, `residual_lm`), the DiT estimator decoder,
  `feat_encoder.encoder`, and the top-level projection heads
  (`enc_to_lm_proj`, `lm_to_dit_proj`, `res_to_dit_proj`,
  `fusion_concat_proj`, `stop_proj`, `stop_head`, `fsq_layer.*`, the
  time/delta-time MLPs).
- **What stays bfloat16**: All `audio_vae.*` weights, RMSNorm /
  LayerNorm gain tensors, RoPE lookup tables, Snake `alpha`,
  embedding tables, and 1-D parameters.
- **Round-trip fidelity vs bf16**: mean relative L2 error
  **0.53 %**, worst-layer relative L2 **0.78 %** (`stop_head`).
- 40 % smaller than bf16 with negligible quality impact in practice.

## Usage with `speech-swift`

This bundle is consumed by [soniqo/speech-swift](https://github.com/soniqo/speech-swift)'s
`VoxCPM2TTS` Swift module.

```swift
import VoxCPM2TTS

let model = try await VoxCPM2TTSModel.fromPretrained(
    modelId: "aufklarer/VoxCPM2-MLX-int8"
)
let audio = try await model.generate(text: "Hello from VoxCPM2.", language: "english")
```

Or via the CLI:

```bash
speech speak "Hello from VoxCPM2." --engine voxcpm2 --voxcpm2-variant int8 -o hi.wav
```

## Source

This bundle is converted from the upstream PyTorch weights at
[openbmb/VoxCPM2](https://huggingface.co/openbmb/VoxCPM2).

## License

Apache 2.0 — inherited from the upstream openbmb/VoxCPM2 model.

## Responsible use

Voice cloning capability is included. Users are responsible for obtaining consent
for any voice that is cloned and for not using the model to impersonate individuals
without their permission, generate disinformation, or commit fraud.