Orpheus 3B — Bangla TTS (High Data Fine-tune)

Fine-tuned version of Orpheus 3B for Bangla (Bengali) Text-to-Speech using LoRA adapters. Trained on ~99K Bangla speech samples for 30,000 steps on H100 GPU.

Model Details

Property Value
Base Model canopylabs/orpheus-3b-0.1-pretrained
Architecture Llama 3B + LoRA adapters
Training Data ~99,000 Bangla speech samples (combined dataset)
Training Steps 30,000
Audio Codec SNAC 24kHz
Training Platform Modal (H100 GPU) with Unsloth
Language Bangla (bn)
License Apache 2.0

What is Orpheus?

Orpheus TTS is a Llama-based text-to-speech model that generates audio as interleaved SNAC codec tokens. It supports emotional speech tags like <laugh>, <sigh>, <gasp>, etc.

Usage

Note: The base model canopylabs/orpheus-3b-0.1-pretrained is gated — you need a HuggingFace token with access approved.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import snac
import numpy as np
import soundfile as sf

# 1. Load base model
base_model_id = "canopylabs/orpheus-3b-0.1-pretrained"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, token="YOUR_HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    token="YOUR_HF_TOKEN"
)

# 2. Resize embeddings (vocab size mismatch fix)
model.resize_token_embeddings(156940)

# 3. Load LoRA adapter
model = PeftModel.from_pretrained(
    model,
    "EMTIAZZ/orpheus-3b-bangla-high-data-finetuning",
    token="YOUR_HF_TOKEN"
)
model = model.merge_and_unload()

# 4. Generate speech
text = "আমি বাংলায় কথা বলতে পারি।"
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
    )

Emotional Speech Tags

The model supports speech emotion/style tags (inherited from Orpheus):

<laugh>   <chuckle>   <sigh>   <cough>   <sniffle>
<groan>   <yawn>      <gasp>

Example: "সে বলল <laugh> এটা সত্যিই মজার!"

Training Details

  • Framework: Unsloth + HuggingFace Trainer
  • Method: LoRA (Low-Rank Adaptation)
  • Dataset: ~99K Bangla samples from multiple speakers including Adiba dataset
  • Hardware: H100 GPU on Modal
  • Training Time: ~30,000 steps

Comparison with Small Data Version

Model Training Data Steps Quality
This model (high data) ~99K samples 30,000 Better prosody, more natural
Small data version ~39K samples 4,500 Faster to infer, lighter

Limitations

  • Requires HuggingFace token to access gated base model
  • Generation can be slow on CPU (recommended: GPU with 16GB+ VRAM)
  • May occasionally mix English phonemes for loanwords

Citation

If you use this model, please cite:

@misc{emtiaz2026orpheusbangla,
  author = {Emtiaz Uddin Ahmed},
  title  = {Orpheus 3B Bangla High-Data Fine-tune},
  year   = {2026},
  url    = {https://huggingface.co/EMTIAZZ/orpheus-3b-bangla-high-data-finetuning}
}

Author

GitHub · Portfolio

Downloads last month
19
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for EMTIAZZ/orpheus-3b-bangla-high-data-finetuning