--- license: mit language: - en tags: - text-to-speech - styletts2 - english - regional-accent - fine-tuned pipeline_tag: text-to-speech --- # StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire) Fine-tuned from `yl4579/StyleTTS2-LibriTTS` on a small (~163 min) corpus of single-speaker British audio across three speakers from the Bolton / Lancashire region: **Sara Cox**, **Maxine Peake**, **Diane Morgan**. ## What this is A 4-epoch fine-tune of StyleTTS2 on Northern English speech that produces **moderate Northern intonation with phonetic stability**. Specifically: - ✅ Recognisable Northern intonation (FOOT-STRUT collapse on common words) - ✅ Clean phonetics — no truncated word endings, no dropped function words - ✅ Stays in the Bolton/Lancashire sub-region of accent space - ✗ Less aggressive accent commitment than F5-TTS at comparable training depth This is one of two checkpoints from the same project. See the [F5-TTS variant](https://huggingface.co/grahamathf/f5-tts-northern-english-ft) for the stronger-accent / less-precise alternative. ## Why epoch 4 specifically We trained to epoch 5+ and observed the model drifting *past* the target accent — `down` rendered as `doon` (Geordie/Scots realisation rather than the Bolton/Lancashire target). Epoch 4 is the sweet spot: committed to Northern intonation, hasn't yet over-fit toward the broader Scots cluster. Detail in the [architecture trade-off write-up](https://gist.github.com/netlinux-ai/220b0990203b403cb786e457cdb7dd8e). ## Usage ```python import torch from styletts2_infer import build, make_sampler, compute_style, inference import soundfile as sf model, _ = build("config.yml", "epoch_2nd_00003.pth") sampler = make_sampler(model) ref_s = compute_style(model, "ref.wav") audio = inference( model, sampler, text="Hello from a fine-tuned model.", ref_s=ref_s, ) sf.write("out.wav", audio, 24000) ``` The full inference helper (~150 LOC) lives at the [minimal F5-TTS trainer gist](https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f) companion file — same shape applies for StyleTTS2 inference. ## Architecture and base - Base model: [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS) — `epochs_2nd_00020.pth` from the LibriTTS multi-speaker training. - Fine-tune: 4 epochs at `lr=1e-4` (defaults from `config_ft.yml`), `batch_size=2`, `max_len=100` (memory-constrained for RTX 3060 12GB), `lambda_slm=0` (WavLM SLM disabled to fit in memory). ## Training corpus composition | Speaker | Source | Segments | Duration | |---|---|---|---| | Sara Cox | "Till the Cows Come Home" + "Thrown" audiobooks | 641 | 101 min | | Maxine Peake | BFI "Working Class Heroes" keynote | 249 | 36 min | | Diane Morgan | BFI "Mandy" Q&A | 109 | 26 min | | **Total** | | **999** | **163 min** | All clean single-speaker content; interview / multi-speaker sources were filtered out at the manifest level. ## Companion writeups - [Architecture trade-off (F5 vs StyleTTS2)](https://gist.github.com/netlinux-ai/220b0990203b403cb786e457cdb7dd8e) - [How human feedback steers TTS fine-tuning](https://gist.github.com/netlinux-ai/372458bf616ab963b1ae556d1faf7d0c) - [Non-AVX2 CPU TTS compatibility notes](https://gist.github.com/netlinux-ai/7b88da46fd52153dd677cade2e6354f8) - [Project canonical site](https://netlinux-ai.github.io/) ## Limitations - **Not for commercial use of cloned voices.** Per yl4579/StyleTTS2's license terms, only use voices whose speakers consent to cloning, or publicly disclose synthesis. The training-corpus speakers did not consent to having their voices cloned — this model demonstrates the technique, not a production voice. - **Bolton/Lancashire-specific.** The fine-tune drifts past this sub-region toward Geordie/Scots if pushed beyond epoch 4. For other Northern sub-regions you'd need different training data. - **Memory-constrained training.** `max_len=100` (~1 sec per sample) limits the prosody-level patterns the model sees; longer-context training on a bigger GPU would likely produce stronger results. - **Single-speaker training-data dominance.** Sara Cox is 62 % of the corpus by duration, so the model pulls toward audiobook-narration register. ## Citation Built from yl4579/StyleTTS2 plus the corpus above. If you use this work, cite the underlying StyleTTS2 paper and link back to this repo: ``` @misc{styletts2-northern-english-ft-2026, author = {netlinux-ai}, title = {StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)}, year = {2026}, url = {https://huggingface.co/grahamathf/styletts2-northern-english-ft}, } ```