---
license: agpl-3.0
datasets:
- MrDragonFox/Elise
language:
- en
base_model:
- sesame/csm-1b
pipeline_tag: text-to-speech
library_name: transformers
tags:
- generative-ai
---
# CSM Elise Voice Model LoRA

This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) using the [Elise dataset](https://huggingface.co/datasets/MrDragonFox/Elise) with LoRA. There are sample outputs files in the repository.

The sound quality seems to be better than tuning on full-parameters. However, more tweaking would be needed to ensure consistent performance. From the sample we can hear two distinct sounds (soft and vibrant) when prompt differently. Also, model performance on larger tokens will be to be further validated.

Larger training data would be required for more consistent sound effect as the current dataset is small and limited.

## Model Details
- **Base Model**: sesame/csm-1b
- **Training Data**: MrDragonFox/Elise dataset
- **Fine-tuning Approach**: Voice cloning through conditional speech generation using LoRA
- **Voice Characteristics**: [Describe voice qualities]
- **Training Parameters**:
  - Learning Rate: 1e-5
  - Epochs: 4
  - Batch Size: 1 with gradient accumulation steps of 4

## Quick Start

```python
import torch
from transformers import CsmForConditionalGeneration, AutoProcessor
from peft import PeftModel
import soundfile as sf
from IPython.display import Audio, display

# Device setup
device = "cuda" if torch.cuda.is_available() else "cpu"

# Load model and processor
base_model_id = "sesame/csm-1b"
adapter_model_id = "keanteng/sesame-csm-elise-lora"  # your uploaded model

# Load processor
processor = AutoProcessor.from_pretrained(base_model_id)

# Load base model
base_model = CsmForConditionalGeneration.from_pretrained(
    base_model_id, 
    device_map=device,
    torch_dtype=torch.float16  # Use half precision for faster inference
)

# Load adapter and merge weights
model = PeftModel.from_pretrained(base_model, adapter_model_id)
model = model.merge_and_unload()  # Merge adapter weights into base model

# Optimize for generation
model.generation_config.max_length = 256
model.generation_config.use_cache = True
model.generation_config.cache_implementation = "static"

if hasattr(model, "depth_decoder"):
    model.depth_decoder.generation_config.cache_implementation = "static"
```

```python
# Define a simple input
conversation = [
    {"role": "0", "content": [
        {"type": "text", "text": "Hello! I'm so happy to see you today!"}
    ]},
]

# Process input
inputs = processor.apply_chat_template(
    conversation,
    tokenize=True,
    return_dict=True,
).to(device)

# Generate audio
audio = model.generate(**inputs, output_audio=True)

# Convert to numpy and save
audio_cpu = audio[0].to(torch.float32).cpu().numpy()
output_file = "output.wav"
sf.write(output_file, audio_cpu, 24000)

# Play audio if in notebook
try:
    display(Audio(output_file))
except:
    print(f"Audio saved to {output_file}")
```