Orpheus 3B — Bangla TTS (High Data Fine-tune)

Fine-tuned version of Orpheus 3B for Bangla (Bengali) Text-to-Speech using LoRA adapters. Trained on ~99K Bangla speech samples for 30,000 steps on H100 GPU.

Model Details

Property	Value
Base Model	`canopylabs/orpheus-3b-0.1-pretrained`
Architecture	Llama 3B + LoRA adapters
Training Data	~99,000 Bangla speech samples (combined dataset)
Training Steps	30,000
Audio Codec	SNAC 24kHz
Training Platform	Modal (H100 GPU) with Unsloth
Language	Bangla (bn)
License	Apache 2.0

What is Orpheus?

Orpheus TTS is a Llama-based text-to-speech model that generates audio as interleaved SNAC codec tokens. It supports emotional speech tags like <laugh>, <sigh>, <gasp>, etc.

Usage

Note: The base model canopylabs/orpheus-3b-0.1-pretrained is gated — you need a HuggingFace token with access approved.

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import snac
import numpy as np
import soundfile as sf

# 1. Load base model
base_model_id = "canopylabs/orpheus-3b-0.1-pretrained"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, token="YOUR_HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    torch_dtype=torch.float16,
    token="YOUR_HF_TOKEN"
)

# 2. Resize embeddings (vocab size mismatch fix)
model.resize_token_embeddings(156940)

# 3. Load LoRA adapter
model = PeftModel.from_pretrained(
    model,
    "EMTIAZZ/orpheus-3b-bangla-high-data-finetuning",
    token="YOUR_HF_TOKEN"
)
model = model.merge_and_unload()

# 4. Generate speech
text = "আমি বাংলায় কথা বলতে পারি।"
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"

inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=1200,
        do_sample=True,
        temperature=0.6,
        top_p=0.95,
        repetition_penalty=1.1,
    )

Emotional Speech Tags

The model supports speech emotion/style tags (inherited from Orpheus):

<laugh>   <chuckle>   <sigh>   <cough>   <sniffle>
<groan>   <yawn>      <gasp>

Example: "সে বলল <laugh> এটা সত্যিই মজার!"

Training Details

Framework: Unsloth + HuggingFace Trainer
Method: LoRA (Low-Rank Adaptation)
Dataset: ~99K Bangla samples from multiple speakers including Adiba dataset
Hardware: H100 GPU on Modal
Training Time: ~30,000 steps

Comparison with Small Data Version

Model	Training Data	Steps	Quality
This model (high data)	~99K samples	30,000	Better prosody, more natural
Small data version	~39K samples	4,500	Faster to infer, lighter

Limitations

Requires HuggingFace token to access gated base model
Generation can be slow on CPU (recommended: GPU with 16GB+ VRAM)
May occasionally mix English phonemes for loanwords

Citation

If you use this model, please cite:

@misc{emtiaz2026orpheusbangla,
  author = {Emtiaz Uddin Ahmed},
  title  = {Orpheus 3B Bangla High-Data Fine-tune},
  year   = {2026},
  url    = {https://huggingface.co/EMTIAZZ/orpheus-3b-bangla-high-data-finetuning}
}

Author

GitHub · Portfolio

Downloads last month: 19

Model tree for EMTIAZZ/orpheus-3b-bangla-high-data-finetuning

Base model

meta-llama/Llama-3.2-3B-Instruct

Finetuned

canopylabs/orpheus-3b-0.1-pretrained

Adapter

(5)

this model