--- base_model: - openbmb/MiniCPM-Llama3-V-2_5 datasets: - MBZUAI/VideoInstruct-100K - Share14/ShareGemini - xjtupanda/T2Vid-Synthetic library_name: transformers license: apache-2.0 pipeline_tag: video-text-to-text tags: - MiniCPM-V - finetune - MLLM --- # Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation 💻 [GitHub](https://github.com/VITA-MLLM/Sparrow) | 📑 [Paper](https://arxiv.org/pdf/2411.19951) ## Model Summary This model, named Sparrow, was presented in the paper "Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation". It builds upon the success of Multimodal Large Language Models (MLLMs) in vision understanding, specifically addressing the challenge of data efficiency in video-LLMs. The paper revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Preliminary experiments revealed a low learning efficiency when simply scaling up video data samples, which was attributed to a lack of instruction diversity. **Key Highlights:** - **Data Augmentation Method (Sparrow):** Proposes a novel data augmentation method that synthesizes video-like samples from pure text instruction data. - **Efficient Training:** Mixing these synthetic samples with real video data enables a more efficient training scheme, achieving performance comparable to or even superior to baselines trained with significantly more samples. - **Long Video Understanding:** Demonstrates that incorporating these synthetic samples can enhance the performance of long video understanding without requiring explicit training on long video data. The video-LLM is fine-tuned from the image-LLM [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5). ## How to Use You can use the Sparrow model with the `transformers` library. For more detailed instructions on video loading (e.g., extracting frames) and advanced usage scenarios, please refer to the project's [GitHub repository](https://github.com/VITA-MLLM/Sparrow). First, ensure you have the necessary dependencies installed: ```bash pip install transformers torch accelerate pip install -U flash-attn --no-build-isolation # For efficient training and inference ``` Here's a quick example to get started with inference using an image as a proxy for video frames. In a full video-LLM setup, you would process a sequence of frames from a video. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor import torch from PIL import Image import requests from io import BytesIO # The model is fine-tuned from openbmb/MiniCPM-Llama3-V-2_5 model_id = "openbmb/MiniCPM-Llama3-V-2_5" # Load model, tokenizer, and processor tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) model = AutoModelForCausalLM.from_pretrained(model_id, torch_dtype=torch.bfloat16, device_map="cuda", trust_remote_code=True) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) model.eval() # --- Example: Using a single image as a proxy for a video frame --- # For a real video-text-to-text task, you would typically: # 1. Load a video using libraries like `decord` or `imageio`. # 2. Extract a sequence of representative frames from the video. # 3. Pass these frames as a list of PIL Images to the processor. # For demonstration, we use a single dummy image: image_url = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/transformers/tasks/car.jpg" response = requests.get(image_url) image_frame = Image.open(BytesIO(response.content)).convert("RGB") # Prepare the conversation input with image and text # For video input, 'content' would be a list like: [frame1, frame2, ..., question_text] question = "Describe the scene shown in this image in detail." messages = [{'role': 'user', 'content': [image_frame, question]}] # Process inputs inputs = processor(messages, return_tensors="pt") inputs = {k: v.to(model.device) for k, v in inputs.items()} # Generate text response with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=100, do_sample=False, repetition_penalty=1.05, ) response_text = tokenizer.decode(outputs[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True) print(response_text) ``` ## License #### Model License * The code in this repo is released under the [Apache-2.0](https://github.com/OpenBMB/MiniCPM/blob/main/LICENSE) License. * The usage of MiniCPM-V series model weights must strictly follow [MiniCPM Model License.md](https://github.com/OpenBMB/MiniCPM/blob/main/MiniCPM%20Model%20License.md). * The models and weights of MiniCPM are completely free for academic research. After filling out a ["questionnaire"](https://modelbest.feishu.cn/share/base/form/shrcnpV5ZT9EJ6xYjh3Kx0J6v8g) for registration, are also available for free commercial use. #### Statement * As an LLM, MiniCPM-Llama3-V 2.5 generates contents by learning a large mount of texts, but it cannot comprehend, express personal opinions or make value judgement. Anything generated by MiniCPM-Llama3-V 2.5 does not represent the views and positions of the model developers * We will not be liable for any problems arising from the use of the MinCPM-V open Source model, including but not limited to data security issues, risk of public opinion, or any risks and problems arising from the misdirection, misuse, dissemination or misuse of the model. ## Training dataset - 10K video instruction data from Video-ChatGPT - 10K video caption data from ShareGemini - 10K synthetic data derived from long text instruction data