---
license: cc-by-nc-sa-4.0
library\_name: transformers
language:
- en
pipeline\_tag: image-text-to-text
tags:
- visual-question-answering
- medical
- healthcare
- biology
- multimodal
- lvlm
- grounding
datasets: HPAI-BSC/Aloe-Beta-General-Collection
model\_type: qwen2-vl
base\_model: Qwen/Qwen2-VL-7B-Instruct
---
**Aloe-Vision** is a **medical Large Vision–Language Model** built on **Qwen2-VL-Instruct**, released in **7B and 72B** sizes. The model is trained on a **\~3.5 M samples** balanced mixture across **medical vs. general** and **multimodal vs. text-only** sources, rebalanced by **loss-contributing assistant tokens** to avoid long-answer bias. We implement **leakage control of evaluation images in the training data** via **exact 64-bit image-hash matching**, removing any duplicates from the training. **Quality filtering** of the training data combines (1) **LVLM-based sample scoring (1–5 scale)** for image–question–answer coherence and relevance and (2) **answer perplexity checks** to flag trivial or noisy annotations. Thresholds are dataset-specific and manually tuned, leading to the removal of low-quality outliers while preserving clinically meaningful diversity. Furthermore, the model is additionally fine-tuned on **17.2 K adversarially perturbed medical samples** to enhance robustness against sycophantic and misleading multimodal cues. The model is released for research purposes under **CC BY-NC-SA 4.0**.
---
## Model Details
* **Base model**: Qwen2-VL-Instruct (7B / 72B)
* **Variant**: Aloe-Vision-7B-AR (Adversarially Robust)
* **Training type**: Two-stage SFT (medical + adversarial fine-tuning)
* **Sizes**: 7B, 72B
* **Languages**: English
* **Images per turn**: Qwen2-VL style multi-image support
* **License**: **CC BY-NC-SA 4.0**
* **Developed by**: [HPAI — Barcelona Supercomputing Center (BSC)](https://hpai.bsc.es/)
* **Contact**: [hpai@bsc.es](mailto:hpai@bsc.es)
---
## Intended Use & Out-of-Scope
**Intended**: research on medical VQA and multimodal reasoning, dataset analysis, academic benchmarking.
**Out-of-scope**:
* clinical diagnosis/treatment, triage, or any unsupervised medical use.
* generation of harmful, misleading, or fraudulent medical content.
* processing of PHI or any personally identifiable patient data.
---
## How to Use
Aloe-Vision follows the **Qwen2-VL** chat template and processor API. Replace the image path(s) and prompt content to suit your use case.
### Python (Transformers)
```python
import torch
from transformers import AutoProcessor, AutoModelForVision2Seq
model_id = "HPAI-BSC/Aloe-Vision-7B-AR"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/your_image.png"},
{"type": "text", "text": "What abnormality do you observe? Be concise."}
]
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs = processor.process_vision_info(messages)
inputs = processor(
text=[text],
**image_inputs,
return_tensors="pt"
).to(model.device)
generated = model.generate(
**inputs,
max_new_tokens=256,
do_sample=False,
eos_token_id=processor.tokenizer.eos_token_id,
)
output_text = processor.batch_decode(generated, skip_special_tokens=True)[0]
print(output_text.split(text)[-1].strip())
```
**Grounding**: Aloe-Vision supports region-referenced grounding using Qwen2-VL **box marker tokens**.
---
## Training Summary
* **Training type**: Two-stage SFT (medical + adversarial fine-tuning)
* **Stack**: TRL + DeepSpeed **ZeRO-3**
* **Precision**: **BF16**
* **Global batch size**: **1024**
* **Micro batch size**: **16**
* **Epochs**: **1**
* **Sequence length**: **4096**
* **LR**: **3.75e-5**, **Cosine** schedule, **warmup 3%**
* **Optimizer**: **AdamW**
* **Grad checkpointing**: enabled
* **Parallelism**: DeepSpeed ZeRO-3
### Compute
* **Cluster**: MareNostrum-5 (BSC)
* **Nodes/GPUs**: **8 nodes Ă— 4Ă— NVIDIA H100** (total 32 GPUs)
* **GPU hours**: **\~500**
---
## Training Data
We construct a balanced mixture across two axes: **modality** (multimodal vs text-only) and **domain** (medical vs general). All sources are normalized to a unified *trl* conversation schema. Medical multimodal includes both global understanding and fine-grained region reasoning.
The dataset can be found in [HPAI-BSC/Aloe-Vision-Data](https://huggingface.co/datasets/HPAI-BSC/Aloe-Vision-Data).
---
## Evaluation
Aloe-Vision targets **comprehensive evaluation** across medical multimodal, medical text-only, general multimodal, and general text-only tasks. Benchmarks are run with identical settings for Aloe-Vision and baselines to ensure reproducibility.
**Benchmarks**:
* **PathMMU** (multi, medical, MCQ) — 1.1K
* **GMAI-MMBench** (multi, medical, MCQ) — 4.5K
* **OmniMedVQA** (multi, medical, MCQ) — 89K
* **ProbMed** (multi, medical, Y/N) — 57K
* **SLAKE** (multi, medical, open-ended; **LLM-as-judge**) — 2K
* **MMMU** (multi, general, MCQ) — 1.4K
* **MultiMedQA** (text, medical, MCQ) — 7K
* **MMLU** (text, general, MCQ) — 14K
**Evaluation protocol**
* Multimodal via **VLMEvalKit**, text-only via **lm-evaluation-harness**.
* **Decoding**: greedy, accuracy by exact match for MCQ and Y/N.
* **LLM-as-judge** (SLAKE): **Qwen2.5-VL-72B** with a rubric-based {0.0, 0.5, 1.0} scale.
### Results
| Model | OmniMedVQA | GMAI-MMBENCH | PathMMU | ProbMed | SLAKE | MMMU | MultiMedQA | MMLU |
| ------------------------------------ | :----------: | :----------: | :----------: | :----------: | :----------: | :----------: | :----------: | :----------: |
| **Kimi-VL-A3B-Instruct *(general)*** | 71.30 | 46.20 | 49.65 | 78.91 | 65.06 | 52.00 | 59.21 | 69.04 |
| **MiMo-7B-RL *(general)*** | 63.80 | 43.82 | 51.75 | 74.80 | 61.13 | 21.67 | 55.88 | 68.42 |
| **Qwen2-VL-7B *(general)*** | 71.40 | 46.42 | 54.90 | 72.87 | 64.11 | 50.44 | 59.67 | 67.82 |
| **InternVL3.5-8B *(general)*** | **87.20** | **57.96** | 65.06 | **79.51** | 75.31 | 54.67 | **63.95** | **75.56** |
| **HuatuoGPT-Vision-7B** | 71.40 | 47.23 | 57.09 | 76.14 | 60.65 | 39.89 | 57.93 | 67.61 |
| **Linghsu-7B** | 79.50 | 52.31 | **66.55** | 79.00 | **80.18** | **57.89** | 62.09 | 69.37 |
| **Chiron-o1-8B** | 71.40 | 41.41 | 55.87 | 73.73 | 66.49 | 43.22 | 59.65 | 71.56 |
| **Aloe-Vision-7B** | 76.50 | 52.79 | 61.82 | 76.69 | 65.40 | 45.11 | 58.48 | 65.95 |
| **Aloe-Vision-7B-AR** | 77.60 | 53.95 | 65.32 | 79.35 | 63.39 | 48.33 | 61.82 | 66.31 |
---
## Adversarial Robustness
To improve robustness against noisy or misleading inputs, we conducted an additional fine-tuning stage focused on adversarial robustness. This stage aimed to mitigate common LVLM vulnerabilities such as sycophantic behavior or misleading multimodal cues. An adversarial benchmark was first created by applying controlled perturbations to existing medical datasets (distinct from those used in evaluation). These perturbations introduced conflicting or false multimodal signals (e.g., mismatched region annotations or incorrect textual hints).
Using this adversarially transformed dataset, we trained an Aloe-Vision-7B-AR variant through a single-stage post-training SFT consisting of 17.2K adversarial samples. The adversarial fine-tuning employed the same optimization setup as the base model and ran for 1 epoch.
This procedure yielded substantial improvements across all adversarial evaluation categories while preserving performance on standard benchmarks.
The following table reports model accuracy (%) under different **adversarial perturbations**.
Columns correspond to:
* **Cap** = misleading captions inserted into the *image*
* **Pmt** = misleading captions in the *prompt*
* **Syc** = sycophantic prompt bias
* **Leg** = misleading legends inserted into the *image*
| **Model** | **Cls Base** | **Cap** | **Pmt** | **Syc** | **Det Base** | **Cap** | **Pmt** | **Syc** | **Leg** |
| :--------------------- | :----------: | :------: | :------: | :------: | :----------: | :------: | :------: | :------: | :------: |
| MiMo-VL-7B | 54.4 | 1.2 | 1.8 | 6.9 | 64.8 | 5.9 | 3.2 | 8.2 | 35.9 |
| Qwen2-VL-7B | 52.5 | 0.5 | 2.0 | 11.4 | 62.7 | 27.1 | 13.2 | 9.8 | 37.0 |
| InternVL3.5-8B | 66.6 | 0.8 | 2.6 | 20.6 | 72.8 | 32.4 | 24.8 | 10.2 | 47.9 |
| HuatuoGPT-Vision-7B | 57.9 | **19.4** | 6.2 | 29.4 | 61.1 | 40.8 | 4.8 | 7.2 | 47.1 |
| Lingshu-7B | **79.5** | 2.5 | 20.2 | 44.8 | 76.8 | 18.2 | 16.1 | 27.3 | 51.3 |
| Chiron-o1-8B | 48.7 | 7.1 | 7.4 | **56.6** | 58.1 | 27.1 | 12.6 | 32.9 | 39.6 |
| **Aloe-Vision-7B** | 59.7 | 3.9 | 14.7 | 42.6 | 61.7 | 53.0 | 16.0 | 14.3 | 50.9 |
| **Aloe-Vision-7B-AR** | 65.8 | 14.2 | **44.2** | 50.2 | **78.7** | **75.0** | **70.6** | **71.1** | **72.0** |
---
## Safety, Risks & Limitations
* **Not a medical device**. Do **not** rely on outputs for diagnosis/treatment.
* **Failure modes**: may hallucinate, misinterpret findings, or over-generalize across modalities and specialties.
* **Sensitive content**: can produce unsafe content if prompted adversarially.
**Recommended practice**
* Keep a **qualified clinician** in the loop for any medically relevant use.
---
**Clinical safety**: Aloe-Vision is a research model. It must **not** be used for diagnosis, treatment, or clinical decision-making. Always place a qualified human in the loop.
---
## Citation
Paper not published yet.
---
## Acknowledgments
Developed by the **High Performance Artificial Intelligence (HPAI)** group at **Barcelona Supercomputing Center (BSC)**. Contact: **[hpai@bsc.es](mailto:hpai@bsc.es)**.