---
library_name: peft
license: apache-2.0
base_model: Qwen/Qwen2.5-VL-7B-Instruct
datasets:
  - EasonFan/AirCopBench
tags:
  - lora
  - peft
  - multimodal
  - vision-language
  - uav
  - aerial
  - visual-question-answering
  - multi-agent-perception
pipeline_tag: image-text-to-text
---

# aircop-7b — Qwen2.5-VL-7B fine-tuned on AirCopBench

LoRA adapter for **Qwen/Qwen2.5-VL-7B-Instruct**, supervised fine-tuned on the
training split of **[AirCopBench](https://huggingface.co/datasets/EasonFan/AirCopBench)**,
a multi-UAV collaborative aerial perception VQA benchmark.

Paper: https://arxiv.org/pdf/2511.11025

## Task

Each question shows the same scene captured at the same moment by 2–6 UAV cameras
from different viewpoints, and asks a 4-way multiple-choice question (object grounding,
counting, matching, causal/collaboration assessment, etc.). The model answers with a
single option letter.

## Results (AirCopBench test, 1025 questions)

| Subset | Accuracy |
|---|---|
| **Overall** | **0.7532** (772/1025) |
| Real2 (2 real UAVs) | 0.5785 |
| Sim3 (3 sim UAVs) | 0.8244 |
| Sim5 (5 sim UAVs) | 0.7551 |
| Sim6 (6 sim UAVs) | 0.7634 |

Parse failures: 0.

## Training

- Method: LoRA SFT (rank 16, `lora_target: all`), 1 epoch, bf16, flash-attn 2
- Effective batch size 16 (per-device 8 × grad-accum 2), lr 1e-4 cosine, `image_max_pixels` 262144
- Framework: LLaMA-Factory, template `qwen2_vl`
- ~12.7k multi-image samples (Real2 / Sim3 / Sim5 / Sim6)

## Usage

```python
import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
from peft import PeftModel

base = "Qwen/Qwen2.5-VL-7B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(base, dtype=torch.bfloat16, device_map="cuda")
model = PeftModel.from_pretrained(model, "EasonFan/aircop-7b")
processor = AutoProcessor.from_pretrained(base)

messages = [{"role": "user", "content": [
    {"type": "text", "text": "UAV1:"}, {"type": "image"},
    {"type": "text", "text": "UAV2:"}, {"type": "image"},
    {"type": "text", "text": "Question: ...\nOptions:\nA. ...\nB. ...\nC. ...\nD. ...\nAnswer with only the letter."},
]}]
# build inputs with processor.apply_chat_template + processor(...) and call model.generate()
```