Text-to-Speech
PEFT
Safetensors
Bengali
tts
bangla
bengali
orpheus
lora
fine-tuning
low-resource
speech-synthesis
Instructions to use EMTIAZZ/orpheus-3b-bangla-high-data-finetuning with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use EMTIAZZ/orpheus-3b-bangla-high-data-finetuning with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("canopylabs/orpheus-3b-0.1-pretrained") model = PeftModel.from_pretrained(base_model, "EMTIAZZ/orpheus-3b-bangla-high-data-finetuning") - Notebooks
- Google Colab
- Kaggle
Orpheus 3B — Bangla TTS (High Data Fine-tune)
Fine-tuned version of Orpheus 3B for Bangla (Bengali) Text-to-Speech using LoRA adapters. Trained on ~99K Bangla speech samples for 30,000 steps on H100 GPU.
Model Details
| Property | Value |
|---|---|
| Base Model | canopylabs/orpheus-3b-0.1-pretrained |
| Architecture | Llama 3B + LoRA adapters |
| Training Data | ~99,000 Bangla speech samples (combined dataset) |
| Training Steps | 30,000 |
| Audio Codec | SNAC 24kHz |
| Training Platform | Modal (H100 GPU) with Unsloth |
| Language | Bangla (bn) |
| License | Apache 2.0 |
What is Orpheus?
Orpheus TTS is a Llama-based text-to-speech model that generates audio as interleaved SNAC codec tokens. It supports emotional speech tags like <laugh>, <sigh>, <gasp>, etc.
Usage
Note: The base model
canopylabs/orpheus-3b-0.1-pretrainedis gated — you need a HuggingFace token with access approved.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import snac
import numpy as np
import soundfile as sf
# 1. Load base model
base_model_id = "canopylabs/orpheus-3b-0.1-pretrained"
tokenizer = AutoTokenizer.from_pretrained(base_model_id, token="YOUR_HF_TOKEN")
model = AutoModelForCausalLM.from_pretrained(
base_model_id,
torch_dtype=torch.float16,
token="YOUR_HF_TOKEN"
)
# 2. Resize embeddings (vocab size mismatch fix)
model.resize_token_embeddings(156940)
# 3. Load LoRA adapter
model = PeftModel.from_pretrained(
model,
"EMTIAZZ/orpheus-3b-bangla-high-data-finetuning",
token="YOUR_HF_TOKEN"
)
model = model.merge_and_unload()
# 4. Generate speech
text = "আমি বাংলায় কথা বলতে পারি।"
prompt = f"<|begin_of_text|><|start_header_id|>user<|end_header_id|>\n\n{text}<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"
inputs = tokenizer(prompt, return_tensors="pt")
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=1200,
do_sample=True,
temperature=0.6,
top_p=0.95,
repetition_penalty=1.1,
)
Emotional Speech Tags
The model supports speech emotion/style tags (inherited from Orpheus):
<laugh> <chuckle> <sigh> <cough> <sniffle>
<groan> <yawn> <gasp>
Example: "সে বলল <laugh> এটা সত্যিই মজার!"
Training Details
- Framework: Unsloth + HuggingFace Trainer
- Method: LoRA (Low-Rank Adaptation)
- Dataset: ~99K Bangla samples from multiple speakers including Adiba dataset
- Hardware: H100 GPU on Modal
- Training Time: ~30,000 steps
Comparison with Small Data Version
| Model | Training Data | Steps | Quality |
|---|---|---|---|
| This model (high data) | ~99K samples | 30,000 | Better prosody, more natural |
| Small data version | ~39K samples | 4,500 | Faster to infer, lighter |
Limitations
- Requires HuggingFace token to access gated base model
- Generation can be slow on CPU (recommended: GPU with 16GB+ VRAM)
- May occasionally mix English phonemes for loanwords
Citation
If you use this model, please cite:
@misc{emtiaz2026orpheusbangla,
author = {Emtiaz Uddin Ahmed},
title = {Orpheus 3B Bangla High-Data Fine-tune},
year = {2026},
url = {https://huggingface.co/EMTIAZZ/orpheus-3b-bangla-high-data-finetuning}
}
Author
- Downloads last month
- 19
Model tree for EMTIAZZ/orpheus-3b-bangla-high-data-finetuning
Base model
meta-llama/Llama-3.2-3B-Instruct Finetuned
canopylabs/orpheus-3b-0.1-pretrained