xjtupanda
/

MiniCPM-V-200K-video-finetune

@@ -22,9 +22,97 @@ tags:
 ## Model Summary
-This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow).  It's a video-LLM fine-tuned from the image-LLM
 [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
 ## License

 ## Model Summary
+This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow). It's a video-LLM fine-tuned from the image-LLM
 [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
+**Abstract:**
+Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data.
+## Sample Usage
+This model is designed for video-language understanding. You can load it using the `transformers` library. Ensure `trust_remote_code=True` is set for proper model loading. For video input, you will typically provide a list of image frames (PIL Images).
+**Prerequisites:**
+You might need `decord` to easily load video frames. Install it via `pip install decord`.
+```python
+from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
+from PIL import Image
+import torch
+import numpy as np
+from decord import VideoReader, cpu # For video loading
+# Load model and processor
+model_id = "VITA-MLLM/Sparrow-Llama3-V-2_5" # Replace with the actual model ID if different
+model = AutoModelForCausalLM.from_pretrained(
+    model_id,
+    torch_dtype=torch.bfloat16, # Use bfloat16 for better performance/memory
+    device_map="auto",
+    trust_remote_code=True
+)
+tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
+processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
+# --- Example: Load video frames ---
+video_path = "path/to/your/video.mp4" # <--- IMPORTANT: Replace with your video file path!
+video_frames = []
+try:
+    vr = VideoReader(video_path, ctx=cpu(0))
+    # Sample a maximum of 32 frames uniformly for demonstration
+    total_frames = len(vr)
+    num_frames_to_sample = min(total_frames, 32)
+    frame_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)
+    video_frames = [Image.fromarray(vr[i].asnumpy()) for i in frame_indices]
+    print(f"Loaded {len(video_frames)} frames from {video_path}")
+except Exception as e:
+    print(f"Could not load video from {video_path}: {e}")
+    print("Using placeholder images for demonstration. Please provide a valid video file.")
+    video_frames = [Image.new("RGB", (224, 224), color="blue")] * 4 # Fallback to placeholder images
+# --- Prepare prompt with video frames ---
+# The <video> tag is specific to MiniCPM-V models for indicating video/image input.
+# It should be repeated for each image frame provided.
+messages = [
+    {"role": "user", "content": "<video>" * len(video_frames) + "
+Describe this video in detail."}
+]
+# Apply chat template and tokenize inputs
+inputs = processor.apply_chat_template(
+    messages,
+    video=video_frames, # Pass the list of PIL Images here
+    tokenize=True,
+    add_generation_prompt=True,
+    return_tensors="pt"
+)
+# Move inputs to appropriate device (e.g., GPU)
+inputs = {k: v.to(model.device) for k, v in inputs.items()}
+# --- Generate response ---
+with torch.no_grad():
+    generated_ids = model.generate(
+        input_ids=inputs["input_ids"],
+        attention_mask=inputs["attention_mask"],
+        image_pixel_values=inputs["image_pixel_values"], # Essential for vision inputs
+        max_new_tokens=256, # Adjust as needed
+        do_sample=True,
+        temperature=0.7,
+        top_p=0.9,
+    )
+# Decode and print the output
+response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
+# Clean up any potential chat template artifacts at the beginning/end
+response = response.split('<|start_header_id|>assistant<|end_header_id|>')[-1].strip()
+print("
+Generated Response:")
+print(response)
+```
 ## License