--- license: agpl-3.0 datasets: - MrDragonFox/Elise language: - en base_model: - sesame/csm-1b pipeline_tag: text-to-speech library_name: transformers tags: - generative-ai --- # CSM Elise Voice Model LoRA This model is a fine-tuned version of [sesame/csm-1b](https://huggingface.co/sesame/csm-1b) using the [Elise dataset](https://huggingface.co/datasets/MrDragonFox/Elise) with LoRA. There are sample outputs files in the repository. The sound quality seems to be better than tuning on full-parameters. However, more tweaking would be needed to ensure consistent performance. From the sample we can hear two distinct sounds (soft and vibrant) when prompt differently. Also, model performance on larger tokens will be to be further validated. Larger training data would be required for more consistent sound effect as the current dataset is small and limited. ## Model Details - **Base Model**: sesame/csm-1b - **Training Data**: MrDragonFox/Elise dataset - **Fine-tuning Approach**: Voice cloning through conditional speech generation using LoRA - **Voice Characteristics**: [Describe voice qualities] - **Training Parameters**: - Learning Rate: 1e-5 - Epochs: 4 - Batch Size: 1 with gradient accumulation steps of 4 ## Quick Start ```python import torch from transformers import CsmForConditionalGeneration, AutoProcessor from peft import PeftModel import soundfile as sf from IPython.display import Audio, display # Device setup device = "cuda" if torch.cuda.is_available() else "cpu" # Load model and processor base_model_id = "sesame/csm-1b" adapter_model_id = "keanteng/sesame-csm-elise-lora" # your uploaded model # Load processor processor = AutoProcessor.from_pretrained(base_model_id) # Load base model base_model = CsmForConditionalGeneration.from_pretrained( base_model_id, device_map=device, torch_dtype=torch.float16 # Use half precision for faster inference ) # Load adapter and merge weights model = PeftModel.from_pretrained(base_model, adapter_model_id) model = model.merge_and_unload() # Merge adapter weights into base model # Optimize for generation model.generation_config.max_length = 256 model.generation_config.use_cache = True model.generation_config.cache_implementation = "static" if hasattr(model, "depth_decoder"): model.depth_decoder.generation_config.cache_implementation = "static" ``` ```python # Define a simple input conversation = [ {"role": "0", "content": [ {"type": "text", "text": "Hello! I'm so happy to see you today!"} ]}, ] # Process input inputs = processor.apply_chat_template( conversation, tokenize=True, return_dict=True, ).to(device) # Generate audio audio = model.generate(**inputs, output_audio=True) # Convert to numpy and save audio_cpu = audio[0].to(torch.float32).cpu().numpy() output_file = "output.wav" sf.write(output_file, audio_cpu, 24000) # Play audio if in notebook try: display(Audio(output_file)) except: print(f"Audio saved to {output_file}") ```