vconnx-bicodec
ONNX artifacts for the bicodec engine of vconnx —
a pure-ONNX multi-engine voice-cloning library.
BiCodec (SparkAudio/Spark-TTS, 2025) achieves zero-shot any-to-any voice conversion via explicit factorization of speech into two complementary token streams:
- Semantic tokens (content): Wav2Vec2-XLSR-53 (hidden layers 11, 14, 16 averaged) → convolutional encoder → FactorizedVQ → (1, T) int64
- Global tokens (speaker): 128-bin Slaney mel spectrogram → ECAPA-TDNN + Perceiver + FSQ → (1, 1, 32) int32 (32 fixed tokens per utterance)
Voice conversion: source semantic tokens + reference global tokens → decoder → waveform. No auto-regressive language model; single forward pass per chunk.
License
Upstream weights: CC BY-NC-SA 4.0 — non-commercial use only.
These ONNX artifacts are derived from SparkAudio/Spark-TTS-0.5B weights which are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.
- You may not use these artifacts for commercial purposes.
- You must give appropriate credit to SparkAudio.
- Derivative works must carry the same CC BY-NC-SA 4.0 license.
Upstream code (SparkAudio/Spark-TTS): Apache-2.0.
Usage
from vconnx import VoiceCloner
cloner = VoiceCloner(engine="bicodec")
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
print(cloner.sample_rate) # 16000
Install: pip install vconnx
Components
| File | Description | Size (fp32) | Size (INT8) |
|---|---|---|---|
wav2vec2_encoder.onnx |
Wav2Vec2-XLSR-53 encoder (layers 11/14/16 avg) | 819 MB | 206 MB |
semantic_encoder.onnx |
Conv encoder + FactorizedVQ → semantic tokens | 116 MB | 30 MB |
global_encoder.onnx |
ECAPA-TDNN + Perceiver + FSQ → global tokens | 22 MB | 6 MB |
decoder.onnx |
Token decoder → waveform | 368 MB | 158 MB |
mel_filterbank.npy |
128-bin Slaney mel filterbank (numpy, 128×513) | 256 KB | — |
mel_config.json |
STFT parameters (n_fft=1024, hop=320, win=640) | ~1 KB | — |
Output sample rate: 16 kHz.
Parity vs PyTorch
| Component | Metric | Value | Result |
|---|---|---|---|
| wav2vec2_encoder | max_abs | 6.71e-4 | PASS |
| semantic_encoder | exact int match | True | PASS |
| global_encoder | exact int match | True | PASS |
| mel numpy vs torchaudio | max_abs | 0.00e+0 | PASS |
| decoder | max_abs | 2.53e-6 | PASS |
References
- Spark-TTS paper
- SparkAudio/Spark-TTS
- SparkAudio/Spark-TTS-0.5B (upstream weights)
- vconnx (inference library)
- Downloads last month
- 2