--- base_model: - openbmb/MiniCPM-Llama3-V-2_5 datasets: - MBZUAI/VideoInstruct-100K - Share14/ShareGemini library_name: transformers license: apache-2.0 pipeline_tag: video-text-to-text tags: - MiniCPM-V - finetune - MLLM --- # Sparrow: Data-Efficient Video-LLM with Text-to-Image Augmentation
## Model Summary This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow). It's a video-LLM fine-tuned from the image-LLM [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5). **Abstract:** Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data. ## Sample Usage This model is designed for video-language understanding. You can load it using the `transformers` library. Ensure `trust_remote_code=True` is set for proper model loading. For video input, you will typically provide a list of image frames (PIL Images). **Prerequisites:** You might need `decord` to easily load video frames. Install it via `pip install decord`. ```python from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor from PIL import Image import torch import numpy as np from decord import VideoReader, cpu # For video loading # Load model and processor model_id = "VITA-MLLM/Sparrow-Llama3-V-2_5" # Replace with the actual model ID if different model = AutoModelForCausalLM.from_pretrained( model_id, torch_dtype=torch.bfloat16, # Use bfloat16 for better performance/memory device_map="auto", trust_remote_code=True ) tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True) processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True) # --- Example: Load video frames --- video_path = "path/to/your/video.mp4" # <--- IMPORTANT: Replace with your video file path! video_frames = [] try: vr = VideoReader(video_path, ctx=cpu(0)) # Sample a maximum of 32 frames uniformly for demonstration total_frames = len(vr) num_frames_to_sample = min(total_frames, 32) frame_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int) video_frames = [Image.fromarray(vr[i].asnumpy()) for i in frame_indices] print(f"Loaded {len(video_frames)} frames from {video_path}") except Exception as e: print(f"Could not load video from {video_path}: {e}") print("Using placeholder images for demonstration. Please provide a valid video file.") video_frames = [Image.new("RGB", (224, 224), color="blue")] * 4 # Fallback to placeholder images # --- Prepare prompt with video frames --- # The