---
license: mit
language:
- en
tags:
- text-to-speech
- styletts2
- english
- regional-accent
- fine-tuned
pipeline_tag: text-to-speech
---

# StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)

Fine-tuned from `yl4579/StyleTTS2-LibriTTS` on a small (~163 min) corpus of
single-speaker British audio across three speakers from the Bolton /
Lancashire region: **Sara Cox**, **Maxine Peake**, **Diane Morgan**.

## What this is

A 4-epoch fine-tune of StyleTTS2 on Northern English speech that produces
**moderate Northern intonation with phonetic stability**. Specifically:

- ✅ Recognisable Northern intonation (FOOT-STRUT collapse on common words)
- ✅ Clean phonetics — no truncated word endings, no dropped function words
- ✅ Stays in the Bolton/Lancashire sub-region of accent space
- ✗ Less aggressive accent commitment than F5-TTS at comparable training depth

This is one of two checkpoints from the same project. See the
[F5-TTS variant](https://huggingface.co/grahamathf/f5-tts-northern-english-ft)
for the stronger-accent / less-precise alternative.

## Why epoch 4 specifically

We trained to epoch 5+ and observed the model drifting *past* the target
accent — `down` rendered as `doon` (Geordie/Scots realisation rather than
the Bolton/Lancashire target). Epoch 4 is the sweet spot: committed to
Northern intonation, hasn't yet over-fit toward the broader Scots cluster.
Detail in the [architecture trade-off write-up](https://gist.github.com/netlinux-ai/220b0990203b403cb786e457cdb7dd8e).

## Usage

```python
import torch
from styletts2_infer import build, make_sampler, compute_style, inference
import soundfile as sf

model, _ = build("config.yml", "epoch_2nd_00003.pth")
sampler = make_sampler(model)
ref_s = compute_style(model, "ref.wav")

audio = inference(
    model, sampler,
    text="Hello from a fine-tuned model.",
    ref_s=ref_s,
)
sf.write("out.wav", audio, 24000)
```

The full inference helper (~150 LOC) lives at the
[minimal F5-TTS trainer gist](https://gist.github.com/netlinux-ai/a7bbf6c64487bdc9ae5ff66731c5646f)
companion file — same shape applies for StyleTTS2 inference.

## Architecture and base

- Base model: [yl4579/StyleTTS2-LibriTTS](https://huggingface.co/yl4579/StyleTTS2-LibriTTS)
  — `epochs_2nd_00020.pth` from the LibriTTS multi-speaker training.
- Fine-tune: 4 epochs at `lr=1e-4` (defaults from `config_ft.yml`),
  `batch_size=2`, `max_len=100` (memory-constrained for RTX 3060 12GB),
  `lambda_slm=0` (WavLM SLM disabled to fit in memory).

## Training corpus composition

| Speaker | Source | Segments | Duration |
|---|---|---|---|
| Sara Cox | "Till the Cows Come Home" + "Thrown" audiobooks | 641 | 101 min |
| Maxine Peake | BFI "Working Class Heroes" keynote | 249 | 36 min |
| Diane Morgan | BFI "Mandy" Q&A | 109 | 26 min |
| **Total** | | **999** | **163 min** |

All clean single-speaker content; interview / multi-speaker sources were
filtered out at the manifest level.

## Companion writeups

- [Architecture trade-off (F5 vs StyleTTS2)](https://gist.github.com/netlinux-ai/220b0990203b403cb786e457cdb7dd8e)
- [How human feedback steers TTS fine-tuning](https://gist.github.com/netlinux-ai/372458bf616ab963b1ae556d1faf7d0c)
- [Non-AVX2 CPU TTS compatibility notes](https://gist.github.com/netlinux-ai/7b88da46fd52153dd677cade2e6354f8)
- [Project canonical site](https://netlinux-ai.github.io/)

## Limitations

- **Not for commercial use of cloned voices.** Per yl4579/StyleTTS2's
  license terms, only use voices whose speakers consent to cloning, or
  publicly disclose synthesis. The training-corpus speakers did not consent
  to having their voices cloned — this model demonstrates the technique,
  not a production voice.
- **Bolton/Lancashire-specific.** The fine-tune drifts past this sub-region
  toward Geordie/Scots if pushed beyond epoch 4. For other Northern
  sub-regions you'd need different training data.
- **Memory-constrained training.** `max_len=100` (~1 sec per sample) limits
  the prosody-level patterns the model sees; longer-context training on a
  bigger GPU would likely produce stronger results.
- **Single-speaker training-data dominance.** Sara Cox is 62 % of the
  corpus by duration, so the model pulls toward audiobook-narration register.

## Citation

Built from yl4579/StyleTTS2 plus the corpus above. If you use this work,
cite the underlying StyleTTS2 paper and link back to this repo:

```
@misc{styletts2-northern-english-ft-2026,
  author = {netlinux-ai},
  title  = {StyleTTS2 fine-tuned for Northern English (Bolton/Lancashire)},
  year   = {2026},
  url    = {https://huggingface.co/grahamathf/styletts2-northern-english-ft},
}
```