vconnx-bicodec

ONNX artifacts for the bicodec engine of vconnx — a pure-ONNX multi-engine voice-cloning library.

BiCodec (SparkAudio/Spark-TTS, 2025) achieves zero-shot any-to-any voice conversion via explicit factorization of speech into two complementary token streams:

  • Semantic tokens (content): Wav2Vec2-XLSR-53 (hidden layers 11, 14, 16 averaged) → convolutional encoder → FactorizedVQ → (1, T) int64
  • Global tokens (speaker): 128-bin Slaney mel spectrogram → ECAPA-TDNN + Perceiver + FSQ → (1, 1, 32) int32 (32 fixed tokens per utterance)

Voice conversion: source semantic tokens + reference global tokens → decoder → waveform. No auto-regressive language model; single forward pass per chunk.

License

Upstream weights: CC BY-NC-SA 4.0 — non-commercial use only.

These ONNX artifacts are derived from SparkAudio/Spark-TTS-0.5B weights which are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

  • You may not use these artifacts for commercial purposes.
  • You must give appropriate credit to SparkAudio.
  • Derivative works must carry the same CC BY-NC-SA 4.0 license.

Upstream code (SparkAudio/Spark-TTS): Apache-2.0.

Usage

from vconnx import VoiceCloner

cloner = VoiceCloner(engine="bicodec")
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
print(cloner.sample_rate)   # 16000

Install: pip install vconnx

Components

File Description Size (fp32) Size (INT8)
wav2vec2_encoder.onnx Wav2Vec2-XLSR-53 encoder (layers 11/14/16 avg) 819 MB 206 MB
semantic_encoder.onnx Conv encoder + FactorizedVQ → semantic tokens 116 MB 30 MB
global_encoder.onnx ECAPA-TDNN + Perceiver + FSQ → global tokens 22 MB 6 MB
decoder.onnx Token decoder → waveform 368 MB 158 MB
mel_filterbank.npy 128-bin Slaney mel filterbank (numpy, 128×513) 256 KB
mel_config.json STFT parameters (n_fft=1024, hop=320, win=640) ~1 KB

Output sample rate: 16 kHz.

Parity vs PyTorch

Component Metric Value Result
wav2vec2_encoder max_abs 6.71e-4 PASS
semantic_encoder exact int match True PASS
global_encoder exact int match True PASS
mel numpy vs torchaudio max_abs 0.00e+0 PASS
decoder max_abs 2.53e-6 PASS

References

Downloads last month
2
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/voiceclonnx-bicodec

Paper for TigreGotico/voiceclonnx-bicodec