---
base_model:
- openbmb/MiniCPM-Llama3-V-2_5
datasets:
- MBZUAI/VideoInstruct-100K
- Share14/ShareGemini
- xjtupanda/T2Vid-Synthetic
library_name: transformers
license: apache-2.0
pipeline_tag: video-text-to-text
tags:
- MiniCPM-V
- finetune
- MLLM
---

# Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation

💻 [GitHub](https://github.com/VITA-MLLM/Sparrow) | 📑 [Paper](https://arxiv.org/pdf/2411.19951)

## Model Summary

This model, named Sparrow, was presented in the paper "Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation". It builds upon the success of Multimodal Large Language Models (MLLMs) in vision understanding, specifically addressing the challenge of data efficiency in video-LLMs.

The paper revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Preliminary experiments revealed a low learning efficiency when simply scaling up video data samples, which was attributed to a lack of instruction diversity.

**Key Highlights:**
- **Data Augmentation Method (Sparrow):** Proposes a novel data augmentation method that synthesizes video-like samples from pure text instruction data.
- **Efficient Training:** Mixing these synthetic samples with real video data enables a more efficient training scheme, achieving performance comparable to or even superior to baselines trained with significantly more samples.
- **Long Video Understanding:** Demonstrates that incorporating these synthetic samples can enhance the performance of long video understanding without requiring explicit training on long video data.

The video-LLM is fine-tuned from the image-LLM [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).

## How to Use

You can use the Sparrow model with the `transformers` library. For more detailed instructions on video loading (e.g., extracting frames) and advanced usage scenarios, please refer to the project's [GitHub repository](https://github.com/VITA-MLLM/Sparrow).

First, ensure you have the necessary dependencies installed:

```bash
pip install transformers torch accelerate
pip install -U flash-attn --no-build-isolation # For efficient training and inference
```

Here's a quick example to get started with inference using an image as a proxy for video frames. In a full video-LLM setup, you would process a sequence of frames from a video.

```python
from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
import torch
from PIL import Image
import requests
from io import BytesIO

# The model is fine-tuned from openbmb/MiniCPM-Llama3-V-2_5
model_id = "openbmb/MiniCPM-Llama3-V-2_5"

# Load model, tokenizer, and processor
tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)

model.eval()

# --- Example: Using a single image as a proxy for a video frame ---
# For a real video-text-to-text task, you would typically:
# 1. Load a video using libraries like `decord` or `imageio`.
# 2. Extract a sequence of representative frames from the video.
# 3. Pass these frames as a list of PIL Images to the processor.

# For demonstration, we use a single dummy image:
image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg"
response = requests.get(image_url)
image_frame = Image.open(BytesIO(response.content)).convert("RGB")

# Prepare the conversation input with image and text
# For video input, 'content' would be a list like: [frame1, frame2, ..., question_text]
question = "Describe the scene shown in this image in detail."
messages = [{'role': 'user', 'content': [image_frame, question]}]

# Process inputs
inputs = processor(messages, return_tensors="pt")
inputs = {k: v.to(model.device) for k, v in inputs.items()}

# Generate text response
with torch.no_grad():
    outputs = model.generate(
        **inputs,
        max_new_tokens=100,
        do_sample=False,
        repetition_penalty=1.05,
    )

response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True)
print(response_text)
```

## License

#### Model License

* The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License.
* The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md).
* The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use.

#### Statement
* As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers
* We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model.

## Training dataset
- 10K video instruction data from Video-ChatGPT
- 10K video caption data from ShareGemini
- 10K synthetic data derived from long text instruction data