---
license: apache-2.0
tags:
- voxtral
- asr
- speech-to-text
- fine-tuning
- tonic

pipeline_tag: automatic-speech-recognition
base_model: {{base_model}}
{{#if has_hub_dataset_id}}
datasets:
- {{dataset_name}}
{{/if}}
{{#if author_name}}
author: {{author_name}}
{{/if}}
{{#if training_config_type}}
training_config: {{training_config_type}}
{{/if}}
{{#if trainer_type}}
trainer_type: {{trainer_type}}
{{/if}}
{{#if batch_size}}
batch_size: {{batch_size}}
{{/if}}
{{#if gradient_accumulation_steps}}
gradient_accumulation_steps: {{gradient_accumulation_steps}}
{{/if}}
{{#if learning_rate}}
learning_rate: {{learning_rate}}
{{/if}}
{{#if max_epochs}}
max_epochs: {{max_epochs}}
{{/if}}
{{#if max_seq_length}}
max_seq_length: {{max_seq_length}}
{{/if}}
{{#if hardware_info}}
hardware: "{{hardware_info}}"
{{/if}}
language:
- hi
- en
- fr
- de
- it
- pt
- nl
library_name: peft
---

# {{model_name}}

{{model_description}}

## Usage

```python
import torch
from transformers import AutoProcessor, AutoModelForSeq2SeqLM
import soundfile as sf

processor = AutoProcessor.from_pretrained("{{repo_name}}")
model = AutoModelForSeq2SeqLM.from_pretrained(
    "{{repo_name}}",
    torch_dtype=torch.float16 if torch.cuda.is_available() else torch.float32
)

audio, sr = sf.read("sample.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt")
with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=256)
text = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]
print(text)
```

## Training Configuration

- Base model: {{base_model}}
{{#if training_config_type}}- Config: {{training_config_type}}{{/if}}
{{#if trainer_type}}- Trainer: {{trainer_type}}{{/if}}

## Training Parameters

- Batch size: {{batch_size}}
- Grad accumulation: {{gradient_accumulation_steps}}
- Learning rate: {{learning_rate}}
- Max epochs: {{max_epochs}}
- Sequence length: {{max_seq_length}}

## Hardware

- {{hardware_info}}

## Notes

- This repository contains a fine-tuned Voxtral ASR model.