--- library_name: peft license: apache-2.0 base_model: Qwen/Qwen2.5-VL-7B-Instruct datasets: - EasonFan/AirCopBench tags: - lora - peft - multimodal - vision-language - uav - aerial - visual-question-answering - multi-agent-perception pipeline_tag: image-text-to-text --- # aircop-7b — Qwen2.5-VL-7B fine-tuned on AirCopBench LoRA adapter for **Qwen/Qwen2.5-VL-7B-Instruct**, supervised fine-tuned on the training split of **[AirCopBench](https://huggingface.co/datasets/EasonFan/AirCopBench)**, a multi-UAV collaborative aerial perception VQA benchmark. Paper: https://arxiv.org/pdf/2511.11025 ## Task Each question shows the same scene captured at the same moment by 2–6 UAV cameras from different viewpoints, and asks a 4-way multiple-choice question (object grounding, counting, matching, causal/collaboration assessment, etc.). The model answers with a single option letter. ## Results (AirCopBench test, 1025 questions) | Subset | Accuracy | |---|---| | **Overall** | **0.7532** (772/1025) | | Real2 (2 real UAVs) | 0.5785 | | Sim3 (3 sim UAVs) | 0.8244 | | Sim5 (5 sim UAVs) | 0.7551 | | Sim6 (6 sim UAVs) | 0.7634 | Parse failures: 0. ## Training - Method: LoRA SFT (rank 16, `lora_target: all`), 1 epoch, bf16, flash-attn 2 - Effective batch size 16 (per-device 8 × grad-accum 2), lr 1e-4 cosine, `image_max_pixels` 262144 - Framework: LLaMA-Factory, template `qwen2_vl` - ~12.7k multi-image samples (Real2 / Sim3 / Sim5 / Sim6) ## Usage ```python import torch from transformers import AutoModelForImageTextToText, AutoProcessor from peft import PeftModel base = "Qwen/Qwen2.5-VL-7B-Instruct" model = AutoModelForImageTextToText.from_pretrained(base, dtype=torch.bfloat16, device_map="cuda") model = PeftModel.from_pretrained(model, "EasonFan/aircop-7b") processor = AutoProcessor.from_pretrained(base) messages = [{"role": "user", "content": [ {"type": "text", "text": "UAV1:"}, {"type": "image"}, {"type": "text", "text": "UAV2:"}, {"type": "image"}, {"type": "text", "text": "Question: ...\nOptions:\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer with only the letter."}, ]}] # build inputs with processor.apply_chat_template + processor(...) and call model.generate() ```