Instructions to use Stanslab/cosyvoice3-wolnelektury-v7 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- CosyVoice
How to use Stanslab/cosyvoice3-wolnelektury-v7 with CosyVoice:
# No code snippets available yet for this library. # To use this model, check the repository files and the library's documentation. # Want to help? PRs adding snippets are welcome at: # https://github.com/huggingface/huggingface.js
- Notebooks
- Google Colab
- Kaggle
CosyVoice3 WolneLektury v7 Fine-Tune
This repository contains an inference checkpoint for a Polish CosyVoice3 fine-tune trained on audiobook-style Polish speech from Wolne Lektury data.
Important: LoRA Adapter, Not A Full Model
This is a LoRA adapter/checkpoint for the base CosyVoice3 model. It is not a full standalone model and does not include the 9 GB base model weights.
To run inference, you need all of the following:
- The base model:
FunAudioLLM/Fun-CosyVoice3-0.5B-2512 - The original CosyVoice code or a compatible loader.
- The files in this repository under
step_007500/. - A loader that applies the LoRA weights and then loads the fine-tuned
speech_embeddingandllm_decodermodules.
The included examples/mac_tts_server.py is one such loader.
Prerequisites
- Python 3.10 or newer is recommended.
- ffmpeg must be installed on your system (e.g.,
brew install ffmpegon macOS). - macOS with Apple Silicon (M1/M2/M3) is the primary target for the provided requirements, though it may work on other systems with adjusted dependencies.
What Is Included
step_007500/
config.json
lora_weights.pt
speech_embedding.pt
llm_decoder.pt
examples/
mac_tts_server.py
requirements-macos-api.txt
checksums.sha256
The checkpoint contains:
- LoRA weights for the CosyVoice3 LLM module.
- A fine-tuned
speech_embeddingmodule. - A fine-tuned
llm_decodermodule. - Training/evaluation metadata in
config.json.
The package intentionally does not include optimizer.pt or scheduler.pt, because those are only needed to resume training and are not required for inference.
Base Model And License Notes
The base model is published by FunAudioLLM under apache-2.0 on Hugging Face:
- Base model:
FunAudioLLM/Fun-CosyVoice3-0.5B-2512 - CosyVoice code:
FunAudioLLM/CosyVoice
The fine-tune data came from Polish Wolne Lektury audiobook material. Wolne Lektury materials are distributed under free licenses, commonly CC BY-SA 3.0 or the Free Art License, with attribution requirements. Treat this repository as: base model Apache-2.0, fine-tune trained on CC BY-SA-style Polish audiobook data.
Training Summary
The preserved checkpoint is v7, step 007500, timestamped 2026-05-16 07:11:32.
Training metadata from step_007500/config.json:
| Field | Value |
|---|---|
| step | 7500 |
| loss | 3.7873 |
| acc | 0.2557 |
| validation split | 0.05 |
| batch size | 4 |
| gradient accumulation | 8 |
| max epochs | 20 |
| max speech length | 750 |
| bf16 | true |
| gradient checkpointing | true |
| label smoothing | 0.1 |
| gradient clip | 1.0 |
| LoRA rank | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.1 |
| LoRA LR | 0.0001 |
| speech embedding LR | 0.0002 |
| decoder LR | 0.0002 |
| warmup steps | 300 |
Evaluation settings preserved in the config:
- Primary evaluation mode:
cross_lingual_official - Evaluation sample count:
250 - Evaluation seed:
20260515 - Whisper model size used for evaluation:
large-v3-turbo - Probe modes:
dataset_eop_prompt,dataset_cached_spk,official_cached_spk - Probe sample count:
8
The prompt pattern used for CER evaluation was:
You are a helpful assistant.<|endofprompt|>{text}
How The Adapter Is Loaded
The included examples/mac_tts_server.py applies the checkpoint this way:
- Load the base CosyVoice3 model from
pretrained_models/Fun-CosyVoice3-0.5B. - Build a PEFT LoRA config from
step_007500/config.json. - Load
lora_weights.ptinto the LLM and merge it withmerge_and_unload(). - Strict-load
speech_embedding.pt. - Strict-load
llm_decoder.pt.
The expected LoRA target modules are:
q_proj, k_proj, v_proj, o_proj, down_proj
Setup
Clone CosyVoice and install dependencies:
git clone https://github.com/FunAudioLLM/CosyVoice.git
cd CosyVoice
git submodule update --init --recursive
python -m venv .venv
source .venv/bin/activate
pip install -r ../requirements-macos-api.txt
Download the base model:
from modelscope import snapshot_download
snapshot_download(
"FunAudioLLM/Fun-CosyVoice3-0.5B-2512",
local_dir="pretrained_models/Fun-CosyVoice3-0.5B",
)
Download this fine-tune repo next to the CosyVoice checkout, for example:
cd ..
huggingface-cli download Stanslab/cosyvoice3-wolnelektury-v7 \
--local-dir cosyvoice3-wolnelektury-v7
Run Inference Server
From inside the CosyVoice checkout:
# Option 1: Provide a default reference audio file at startup
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
--model-dir pretrained_models/Fun-CosyVoice3-0.5B \
--checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
--prompt-wav path/to/your/reference_audio.wav \
--host 127.0.0.1 \
--port 5055
# Option 2: Start without a default file (you must upload a WAV in the API call)
python ../cosyvoice3-wolnelektury-v7/examples/mac_tts_server.py \
--model-dir pretrained_models/Fun-CosyVoice3-0.5B \
--checkpoint-dir ../cosyvoice3-wolnelektury-v7/step_007500 \
--port 5055
Note on
--prompt-wav: This model requires a short reference audio clip (WAV, a few seconds of clean Polish speech) to perform cross-lingual or zero-shot synthesis. You can provide a default file at startup OR upload it dynamically with each API request (see below).
Health check:
curl http://127.0.0.1:5055/health
Generate speech with cross-lingual mode:
# Using the default prompt wav (if provided at server startup):
curl -X POST http://127.0.0.1:5055/tts \
-F "text=To jest test polskiej syntezy mowy." \
-F "speed=1.0" \
--output output.wav
# OR uploading a specific reference voice for this request:
curl -X POST http://127.0.0.1:5055/tts \
-F "text=To jest test z dynamicznym głosem." \
-F "prompt_wav=@my_voice_sample.wav" \
--output output_dynamic.wav
The server automatically prepends:
You are a helpful assistant.<|endofprompt|>
unless the text already contains <|endofprompt|>.
Samples
To provide audio samples on your Hugging Face model page, you can generate an example after starting the server and upload it to a samples/ directory:
# Generate a sample
curl -X POST http://127.0.0.1:5055/tts \
-F "text=Witaj, to jest przykładowy głos wygenerowany przez model CosyVoice3 dostrojony na danych Wolne Lektury." \
--output sample_v7.wav
# Upload to Hugging Face
hf upload Stanslab/cosyvoice3-wolnelektury-v7 sample_v7.wav samples/sample_v7.wav
License
Model weights: Apache 2.0 (same as base CosyVoice3 model).
Training data: CC BY-SA 3.0 (Wolne Lektury).
Safety And Intended Use
This model is intended for research and experimentation with Polish text-to-speech fine-tuning. It should not be used to impersonate real people, generate fraudulent audio, create misleading endorsements, bypass biometric systems, or imply endorsement by Wolne Lektury, narrators, authors, or publishers.
Because the training data is audiobook-style speech, the model may reproduce narration patterns present in the data. Quality may vary for dialogue-heavy text, names, unusual punctuation, emotional speech, domain-specific text, and accents underrepresented in the dataset.
Checksums
See checksums.sha256. The original preserved full training archive, not included here, had SHA256:
898ab90db149f2e564356f34f21efb57d7f9828b31ef5c4d0066c627b4388749
Model tree for Stanslab/cosyvoice3-wolnelektury-v7
Base model
FunAudioLLM/Fun-CosyVoice3-0.5B-2512