CosyVoice3 WolneLektury v7 Fine-Tune

This repository contains an inference checkpoint for a Polish CosyVoice3 fine-tune trained on audiobook-style Polish speech from Wolne Lektury data.

Important: LoRA Adapter, Not A Full Model

This is a LoRA adapter/checkpoint for the base CosyVoice3 model. It is not a full standalone model and does not include the 9 GB base model weights.

To run inference, you need all of the following:

  • The base model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
  • The original CosyVoice code or a compatible loader.
  • The files in this repository under step_007500/.
  • A loader that applies the LoRA weights and then loads the fine-tuned speech_embedding and llm_decoder modules.

The included examples/mac_tts_server.py is one such loader.

Prerequisites

  • Python 3.10 or newer is recommended.
  • ffmpeg must be installed on your system (e.g., brew install ffmpeg on macOS).
  • macOS with Apple Silicon (M1/M2/M3) is the primary target for the provided requirements, though it may work on other systems with adjusted dependencies.

What Is Included

step_007500/
  config.json
  lora_weights.pt
  speech_embedding.pt
  llm_decoder.pt
examples/
  mac_tts_server.py
requirements-macos-api.txt
checksums.sha256

The checkpoint contains:

  • LoRA weights for the CosyVoice3 LLM module.
  • A fine-tuned speech_embedding module.
  • A fine-tuned llm_decoder module.
  • Training/evaluation metadata in config.json.

The package intentionally does not include optimizer.pt or scheduler.pt, because those are only needed to resume training and are not required for inference.

Base Model And License Notes

The base model is published by FunAudioLLM under apache-2.0 on Hugging Face:

The fine-tune data came from Polish Wolne Lektury audiobook material. Wolne Lektury materials are distributed under free licenses, commonly CC BY-SA 3.0 or the Free Art License, with attribution requirements. Treat this repository as: base model Apache-2.0, fine-tune trained on CC BY-SA-style Polish audiobook data.

Training Summary

The preserved checkpoint is v7, step 007500, timestamped 2026-05-16 07:11:32.

Training metadata from step_007500/config.json:

Field Value
step 7500
loss 3.7873
acc 0.2557
validation split 0.05
batch size 4
gradient accumulation 8
max epochs 20
max speech length 750
bf16 true
gradient checkpointing true
label smoothing 0.1
gradient clip 1.0
LoRA rank 32
LoRA alpha 64
LoRA dropout 0.1
LoRA LR 0.0001
speech embedding LR 0.0002
decoder LR 0.0002
warmup steps 300

Evaluation settings preserved in the config:

  • Primary evaluation mode: cross_lingual_official
  • Evaluation sample count: 250
  • Evaluation seed: 20260515
  • Whisper model size used for evaluation: large-v3-turbo
  • Probe modes: dataset_eop_prompt, dataset_cached_spk, official_cached_spk
  • Probe sample count: 8

The prompt pattern used for CER evaluation was:

You are a helpful assistant.<|endofprompt|>{text}

How The Adapter Is Loaded

The included examples/mac_tts_server.py applies the checkpoint this way:

  1. Load the base CosyVoice3 model from pretrained_models/Fun-CosyVoice3-0.5B.
  2. Build a PEFT LoRA config from step_007500/config.json.
  3. Load lora_weights.pt into the LLM and merge it with merge_and_unload().
  4. Strict-load speech_embedding.pt.
  5. Strict-load llm_decoder.pt.

The expected LoRA target modules are:

q_proj, k_proj, v_proj, o_proj, down_proj

Setup

Clone CosyVoice and install dependencies:

git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive
python -m venv .venv
source .venv/bin/activate
pip install -r ../requirements-macos-api.txt

Download the base model:

from modelscope import snapshot_download

snapshot_download(
    "FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
    local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)

Download this fine-tune repo next to the CosyVoice checkout, for example:

cd ..
huggingface-cli download Stanslab/cosyvoice3-wolnelektury-v7 \
  --local-dir cosyvoice3-wolnelektury-v7

Run Inference Server

From inside the CosyVoice checkout:

# Option 1: Provide a default reference audio file at startup
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
  --model-dir pretrained_models/Fun-CosyVoice3-0.5B \
  --checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
  --prompt-wav path/to/your/reference_audio.wav \
  --host 127.0.0.1 \
  --port 5055

# Option 2: Start without a default file (you must upload a WAV in the API call)
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
  --model-dir pretrained_models/Fun-CosyVoice3-0.5B \
  --checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
  --port 5055

Note on --prompt-wav: This model requires a short reference audio clip (WAV, a few seconds of clean Polish speech) to perform cross-lingual or zero-shot synthesis. You can provide a default file at startup OR upload it dynamically with each API request (see below).

Health check:

curl http://127.0.0.1:5055/health

Generate speech with cross-lingual mode:

# Using the default prompt wav (if provided at server startup):
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=To jest test polskiej syntezy mowy." \
  -F "speed=1.0" \
  --output output.wav

# OR uploading a specific reference voice for this request:
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=To jest test z dynamicznym głosem." \
  -F "prompt_wav=@my_voice_sample.wav" \
  --output output_dynamic.wav

The server automatically prepends:

You are a helpful assistant.<|endofprompt|>

unless the text already contains <|endofprompt|>.

Samples

To provide audio samples on your Hugging Face model page, you can generate an example after starting the server and upload it to a samples/ directory:

# Generate a sample
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=Witaj, to jest przykładowy głos wygenerowany przez model CosyVoice3 dostrojony na danych Wolne Lektury." \
  --output sample_v7.wav

# Upload to Hugging Face
hf upload Stanslab/cosyvoice3-wolnelektury-v7 sample_v7.wav samples/sample_v7.wav

License

Model weights: Apache 2.0 (same as base CosyVoice3 model).
Training data: CC BY-SA 3.0 (Wolne Lektury).

Safety And Intended Use

This model is intended for research and experimentation with Polish text-to-speech fine-tuning. It should not be used to impersonate real people, generate fraudulent audio, create misleading endorsements, bypass biometric systems, or imply endorsement by Wolne Lektury, narrators, authors, or publishers.

Because the training data is audiobook-style speech, the model may reproduce narration patterns present in the data. Quality may vary for dialogue-heavy text, names, unusual punctuation, emotional speech, domain-specific text, and accents underrepresented in the dataset.

Checksums

See checksums.sha256. The original preserved full training archive, not included here, had SHA256:

898ab90db149f2e564356f34f21efb57d7f9828b31ef5c4d0066c627b4388749
Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for Stanslab/cosyvoice3-wolnelektury-v7

Adapter
(1)
this model