vconnx-bicodec

ONNX artifacts for the bicodec engine of vconnx — a pure-ONNX multi-engine voice-cloning library.

BiCodec (SparkAudio/Spark-TTS, 2025) achieves zero-shot any-to-any voice conversion via explicit factorization of speech into two complementary token streams:

Semantic tokens (content): Wav2Vec2-XLSR-53 (hidden layers 11, 14, 16 averaged) → convolutional encoder → FactorizedVQ → (1, T) int64
Global tokens (speaker): 128-bin Slaney mel spectrogram → ECAPA-TDNN + Perceiver + FSQ → (1, 1, 32) int32 (32 fixed tokens per utterance)

Voice conversion: source semantic tokens + reference global tokens → decoder → waveform. No auto-regressive language model; single forward pass per chunk.

License

Upstream weights: CC BY-NC-SA 4.0 — non-commercial use only.

These ONNX artifacts are derived from SparkAudio/Spark-TTS-0.5B weights which are licensed under the Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International License.

You may not use these artifacts for commercial purposes.
You must give appropriate credit to SparkAudio.
Derivative works must carry the same CC BY-NC-SA 4.0 license.

Upstream code (SparkAudio/Spark-TTS): Apache-2.0.

Usage

from vconnx import VoiceCloner

cloner = VoiceCloner(engine="bicodec")
out = cloner.clone_voice("source.wav", "reference.wav", "out.wav")
print(cloner.sample_rate)   # 16000

Install: pip install vconnx

Components

File	Description	Size (fp32)	Size (INT8)
`wav2vec2_encoder.onnx`	Wav2Vec2-XLSR-53 encoder (layers 11/14/16 avg)	819 MB	206 MB
`semantic_encoder.onnx`	Conv encoder + FactorizedVQ → semantic tokens	116 MB	30 MB
`global_encoder.onnx`	ECAPA-TDNN + Perceiver + FSQ → global tokens	22 MB	6 MB
`decoder.onnx`	Token decoder → waveform	368 MB	158 MB
`mel_filterbank.npy`	128-bin Slaney mel filterbank (numpy, 128×513)	256 KB	—
`mel_config.json`	STFT parameters (n_fft=1024, hop=320, win=640)	~1 KB	—

Output sample rate: 16 kHz.

Parity vs PyTorch

Component	Metric	Value	Result
wav2vec2_encoder	max_abs	6.71e-4	PASS
semantic_encoder	exact int match	True	PASS
global_encoder	exact int match	True	PASS
mel numpy vs torchaudio	max_abs	0.00e+0	PASS
decoder	max_abs	2.53e-6	PASS

References

Spark-TTS paper
SparkAudio/Spark-TTS
SparkAudio/Spark-TTS-0.5B (upstream weights)
vconnx (inference library)

Downloads last month: 2

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Collection including TigreGotico/voiceclonnx-bicodec

voiceclonnx — pure-ONNX voice conversion

Collection

ONNX exports powering the vconnx voice-conversion library: one repo per engine, with parity reports and provenance. • 10 items • Updated 3 days ago

Paper for TigreGotico/voiceclonnx-bicodec

Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens

Paper • 2503.01710 • Published Mar 3, 2025 • 6