CosyVoice3 WolneLektury v7 Fine-Tune

This repository contains an inference checkpoint for a Polish CosyVoice3 fine-tune trained on audiobook-style Polish speech from Wolne Lektury data.

Important: LoRA Adapter, Not A Full Model

This is a LoRA adapter/checkpoint for the base CosyVoice3 model. It is not a full standalone model and does not include the 9 GB base model weights.

To run inference, you need all of the following:

The base model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
The original CosyVoice code or a compatible loader.
The files in this repository under step_007500/.
A loader that applies the LoRA weights and then loads the fine-tuned speech_embedding and llm_decoder modules.

The included examples/mac_tts_server.py is one such loader.

Prerequisites

Python 3.10 or newer is recommended.
ffmpeg must be installed on your system (e.g., brew install ffmpeg on macOS).
macOS with Apple Silicon (M1/M2/M3) is the primary target for the provided requirements, though it may work on other systems with adjusted dependencies.

What Is Included

step_007500/
  config.json
  lora_weights.pt
  speech_embedding.pt
  llm_decoder.pt
examples/
  mac_tts_server.py
requirements-macos-api.txt
checksums.sha256

The checkpoint contains:

LoRA weights for the CosyVoice3 LLM module.
A fine-tuned speech_embedding module.
A fine-tuned llm_decoder module.
Training/evaluation metadata in config.json.

The package intentionally does not include optimizer.pt or scheduler.pt, because those are only needed to resume training and are not required for inference.

Base Model And License Notes

The base model is published by FunAudioLLM under apache-2.0 on Hugging Face:

Base model: FunAudioLLM/Fun-CosyVoice3-0.5B-2512
CosyVoice code: FunAudioLLM/CosyVoice

The fine-tune data came from Polish Wolne Lektury audiobook material. Wolne Lektury materials are distributed under free licenses, commonly CC BY-SA 3.0 or the Free Art License, with attribution requirements. Treat this repository as: base model Apache-2.0, fine-tune trained on CC BY-SA-style Polish audiobook data.

Training Summary

The preserved checkpoint is v7, step 007500, timestamped 2026-05-16 07:11:32.

Training metadata from step_007500/config.json:

Field	Value
step	7500
loss	3.7873
acc	0.2557
validation split	0.05
batch size	4
gradient accumulation	8
max epochs	20
max speech length	750
bf16	true
gradient checkpointing	true
label smoothing	0.1
gradient clip	1.0
LoRA rank	32
LoRA alpha	64
LoRA dropout	0.1
LoRA LR	0.0001
speech embedding LR	0.0002
decoder LR	0.0002
warmup steps	300

Evaluation settings preserved in the config:

Primary evaluation mode: cross_lingual_official
Evaluation sample count: 250
Evaluation seed: 20260515
Whisper model size used for evaluation: large-v3-turbo
Probe modes: dataset_eop_prompt, dataset_cached_spk, official_cached_spk
Probe sample count: 8

The prompt pattern used for CER evaluation was:

You are a helpful assistant.<|endofprompt|>{text}

How The Adapter Is Loaded

The included examples/mac_tts_server.py applies the checkpoint this way:

Load the base CosyVoice3 model from pretrained_models/Fun-CosyVoice3-0.5B.
Build a PEFT LoRA config from step_007500/config.json.
Load lora_weights.pt into the LLM and merge it with merge_and_unload().
Strict-load speech_embedding.pt.
Strict-load llm_decoder.pt.

The expected LoRA target modules are:

q_proj, k_proj, v_proj, o_proj, down_proj

Setup

Clone CosyVoice and install dependencies:

git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive
python -m venv .venv
source .venv/bin/activate
pip install -r ../requirements-macos-api.txt

Download the base model:

from modelscope import snapshot_download

snapshot_download(
    "FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
    local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)

Download this fine-tune repo next to the CosyVoice checkout, for example:

cd ..
huggingface-cli download Stanslab/cosyvoice3-wolnelektury-v7 \
  --local-dir cosyvoice3-wolnelektury-v7

Run Inference Server

From inside the CosyVoice checkout:

# Option 1: Provide a default reference audio file at startup
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
  --model-dir pretrained_models/Fun-CosyVoice3-0.5B \
  --checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
  --prompt-wav path/to/your/reference_audio.wav \
  --host 127.0.0.1 \
  --port 5055

# Option 2: Start without a default file (you must upload a WAV in the API call)
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
  --model-dir pretrained_models/Fun-CosyVoice3-0.5B \
  --checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
  --port 5055

Note on --prompt-wav: This model requires a short reference audio clip (WAV, a few seconds of clean Polish speech) to perform cross-lingual or zero-shot synthesis. You can provide a default file at startup OR upload it dynamically with each API request (see below).

Health check:

curl http://127.0.0.1:5055/health

Generate speech with cross-lingual mode:

# Using the default prompt wav (if provided at server startup):
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=To jest test polskiej syntezy mowy." \
  -F "speed=1.0" \
  --output output.wav

# OR uploading a specific reference voice for this request:
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=To jest test z dynamicznym głosem." \
  -F "prompt_wav=@my_voice_sample.wav" \
  --output output_dynamic.wav

The server automatically prepends:

You are a helpful assistant.<|endofprompt|>

unless the text already contains <|endofprompt|>.

Samples

To provide audio samples on your Hugging Face model page, you can generate an example after starting the server and upload it to a samples/ directory:

# Generate a sample
curl -X POST http://127.0.0.1:5055/tts \
  -F "text=Witaj, to jest przykładowy głos wygenerowany przez model CosyVoice3 dostrojony na danych Wolne Lektury." \
  --output sample_v7.wav

# Upload to Hugging Face
hf upload Stanslab/cosyvoice3-wolnelektury-v7 sample_v7.wav samples/sample_v7.wav

License

Model weights: Apache 2.0 (same as base CosyVoice3 model).
Training data: CC BY-SA 3.0 (Wolne Lektury).

Safety And Intended Use

This model is intended for research and experimentation with Polish text-to-speech fine-tuning. It should not be used to impersonate real people, generate fraudulent audio, create misleading endorsements, bypass biometric systems, or imply endorsement by Wolne Lektury, narrators, authors, or publishers.

Because the training data is audiobook-style speech, the model may reproduce narration patterns present in the data. Quality may vary for dialogue-heavy text, names, unusual punctuation, emotional speech, domain-specific text, and accents underrepresented in the dataset.

Checksums

See checksums.sha256. The original preserved full training archive, not included here, had SHA256:

898ab90db149f2e564356f34f21efb57d7f9828b31ef5c4d0066c627b4388749

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for Stanslab/cosyvoice3-wolnelektury-v7

Base model

FunAudioLLM/Fun-CosyVoice3-0.5B-2512

Adapter

(1)

this model