Improve model card: Add abstract and sample usage

#2
by nielsr HF Staff - opened
Files changed (1) hide show
  1. README.md +89 -1
README.md CHANGED
@@ -22,9 +22,97 @@ tags:
22
 
23
  ## Model Summary
24
 
25
- This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow). It's a video-LLM fine-tuned from the image-LLM
26
  [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
27
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
28
 
29
  ## License
30
 
 
22
 
23
  ## Model Summary
24
 
25
+ This model is a part of the project [Sparrow](https://github.com/VITA-MLLM/Sparrow). It's a video-LLM fine-tuned from the image-LLM
26
  [MiniCPM-Llama3-V-2_5](https://huggingface.co/openbmb/MiniCPM-Llama3-V-2_5).
27
 
28
+ **Abstract:**
29
+ Recent years have seen the success of Multimodal Large Language Models (MLLMs) in the domain of vision understanding. The success of these models can largely be attributed to the dominant scaling law, which states that larger parameter sizes and data volumes contribute to better performance. Notably, data scaling has been primarily driven by automatic data pipelines, which focus on the self-instruction of LLMs. The paradigm has been taken for granted for quite some time, but the study of the effectiveness of scaling with these data has been neglected for a long time. In this context, this work revisits scaling with synthetic data and focuses on developing video-LLMs from a data-centric perspective. Our primary study approach involves fine-tuning pre-trained image-LLMs with video data and examining learning efficiency through data scaling. Results from our preliminary experiments reveal a low learning efficiency phenomenon when simply scaling up video data samples, which, through our probing, can be ascribed to a lack of instruction diversity. Aiming at this issue, we propose a data augmentation method called Sparrow, which synthesizes video-like samples from pure text instruction data. Mixing these synthetic samples with the video data enables a more efficient training scheme. Through comprehensive experiments, we demonstrate that our proposed method achieves performance comparable to or even superior to that of baselines trained with significantly more samples. Meanwhile, we find that incorporating these synthetic samples can enhance the performance of long video understanding without requiring training on long video data.
30
+
31
+ ## Sample Usage
32
+
33
+ This model is designed for video-language understanding. You can load it using the `transformers` library. Ensure `trust_remote_code=True` is set for proper model loading. For video input, you will typically provide a list of image frames (PIL Images).
34
+
35
+ **Prerequisites:**
36
+ You might need `decord` to easily load video frames. Install it via `pip install decord`.
37
+
38
+ ```python
39
+ from transformers import AutoModelForCausalLM, AutoTokenizer, AutoProcessor
40
+ from PIL import Image
41
+ import torch
42
+ import numpy as np
43
+ from decord import VideoReader, cpu # For video loading
44
+
45
+ # Load model and processor
46
+ model_id = "VITA-MLLM/Sparrow-Llama3-V-2_5" # Replace with the actual model ID if different
47
+ model = AutoModelForCausalLM.from_pretrained(
48
+ model_id,
49
+ torch_dtype=torch.bfloat16, # Use bfloat16 for better performance/memory
50
+ device_map="auto",
51
+ trust_remote_code=True
52
+ )
53
+ tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)
54
+ processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
55
+
56
+ # --- Example: Load video frames ---
57
+ video_path = "path/to/your/video.mp4" # <--- IMPORTANT: Replace with your video file path!
58
+ video_frames = []
59
+ try:
60
+ vr = VideoReader(video_path, ctx=cpu(0))
61
+ # Sample a maximum of 32 frames uniformly for demonstration
62
+ total_frames = len(vr)
63
+ num_frames_to_sample = min(total_frames, 32)
64
+ frame_indices = np.linspace(0, total_frames - 1, num_frames_to_sample, dtype=int)
65
+
66
+ video_frames = [Image.fromarray(vr[i].asnumpy()) for i in frame_indices]
67
+ print(f"Loaded {len(video_frames)} frames from {video_path}")
68
+ except Exception as e:
69
+ print(f"Could not load video from {video_path}: {e}")
70
+ print("Using placeholder images for demonstration. Please provide a valid video file.")
71
+ video_frames = [Image.new("RGB", (224, 224), color="blue")] * 4 # Fallback to placeholder images
72
+
73
+
74
+ # --- Prepare prompt with video frames ---
75
+ # The <video> tag is specific to MiniCPM-V models for indicating video/image input.
76
+ # It should be repeated for each image frame provided.
77
+ messages = [
78
+ {"role": "user", "content": "<video>" * len(video_frames) + "
79
+ Describe this video in detail."}
80
+ ]
81
+
82
+ # Apply chat template and tokenize inputs
83
+ inputs = processor.apply_chat_template(
84
+ messages,
85
+ video=video_frames, # Pass the list of PIL Images here
86
+ tokenize=True,
87
+ add_generation_prompt=True,
88
+ return_tensors="pt"
89
+ )
90
+
91
+ # Move inputs to appropriate device (e.g., GPU)
92
+ inputs = {k: v.to(model.device) for k, v in inputs.items()}
93
+
94
+ # --- Generate response ---
95
+ with torch.no_grad():
96
+ generated_ids = model.generate(
97
+ input_ids=inputs["input_ids"],
98
+ attention_mask=inputs["attention_mask"],
99
+ image_pixel_values=inputs["image_pixel_values"], # Essential for vision inputs
100
+ max_new_tokens=256, # Adjust as needed
101
+ do_sample=True,
102
+ temperature=0.7,
103
+ top_p=0.9,
104
+ )
105
+
106
+ # Decode and print the output
107
+ response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]
108
+ # Clean up any potential chat template artifacts at the beginning/end
109
+ response = response.split('<|start_header_id|>assistant<|end_header_id|>')[-1].strip()
110
+
111
+ print("
112
+ Generated Response:")
113
+ print(response)
114
+
115
+ ```
116
 
117
  ## License
118