Image-Text-to-Text
PEFT
Safetensors
lora
multimodal
vision-language
uav
aerial
visual-question-answering
multi-agent-perception
conversational
Instructions to use EasonFan/aircop-7b with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- PEFT
How to use EasonFan/aircop-7b with PEFT:
from peft import PeftModel from transformers import AutoModelForCausalLM base_model = AutoModelForCausalLM.from_pretrained("/hpc2hdd/home/yfan546/data/hf_cache/hub/models--Qwen--Qwen2.5-VL-7B-Instruct/snapshots/cc594898137f460bfe9f0759e9844b3ce807cfb5") model = PeftModel.from_pretrained(base_model, "EasonFan/aircop-7b") - Notebooks
- Google Colab
- Kaggle
metadata
library_name: peft
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
datasets:
- EasonFan/AirCopBench
tags:
- lora
- peft
- multimodal
- vision-language
- uav
- aerial
- visual-question-answering
- multi-agent-perception
pipeline_tag: image-text-to-text
aircop-7b — Qwen2.5-VL-7B fine-tuned on AirCopBench
LoRA adapter for Qwen/Qwen2.5-VL-7B-Instruct, supervised fine-tuned on the training split of AirCopBench, a multi-UAV collaborative aerial perception VQA benchmark.
Paper: https://arxiv.org/pdf/2511.11025
Task
Each question shows the same scene captured at the same moment by 2–6 UAV cameras from different viewpoints, and asks a 4-way multiple-choice question (object grounding, counting, matching, causal/collaboration assessment, etc.). The model answers with a single option letter.
Results (AirCopBench test, 1025 questions)
| Subset | Accuracy |
|---|---|
| Overall | 0.7532 (772/1025) |
| Real2 (2 real UAVs) | 0.5785 |
| Sim3 (3 sim UAVs) | 0.8244 |
| Sim5 (5 sim UAVs) | 0.7551 |
| Sim6 (6 sim UAVs) | 0.7634 |
Parse failures: 0.
Training
- Method: LoRA SFT (rank 16,
lora_target: all), 1 epoch, bf16, flash-attn 2 - Effective batch size 16 (per-device 8 × grad-accum 2), lr 1e-4 cosine,
image_max_pixels262144 - Framework: LLaMA-Factory, template
qwen2_vl - ~12.7k multi-image samples (Real2 / Sim3 / Sim5 / Sim6)
Usage
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel
base = "Qwen/Qwen2.5-VL-7B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(base, dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "EasonFan/aircop-7b")
processor = AutoProcessor.from_pretrained(base)
messages = [{"role": "user", "content": [
{"type": "text", "text": "UAV1:"}, {"type": "image"},
{"type": "text", "text": "UAV2:"}, {"type": "image"},
{"type": "text", "text": "Question: ...\nOptions:\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer with only the letter."},
]}]
# build inputs with processor.apply_chat_template + processor(...) and call model.generate()