---
base_model: "Qwen/Qwen2-VL-2B-Instruct"
library_name: transformers
tags:
  - image-text-to-text
  - regression
arxiv: 2507.14997
github: https://github.com/royhj/RvTC
---

# Regression via Transformer-Based Classification (RvTC) 
- **Model:** Qwen2-VL-2B-Image-Only

[github](https://github.com/royhj/RvTC)   -   [arxiv](https://arxiv.org/abs/2507.14997)

## Model Description

Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. 
This checkpoint uses **image-only training** without textual context.


## Base Model

- **Architecture:** Qwen2-VL-2B-Instruct
- **Source:** Qwen/Qwen2-VL-2B-Instruct

## Training Configuration

- **Dataset:** AVA (Aesthetic Visual Analysis)
- **Training Mode:** Image-only (no textual prompts)
- **Epochs:** 2
- **Learning Rate:** 1e-5
- **Batch Size:** 128 (training)
- **Optimizer:** AdamW with cosine scheduler
- **Warmup Ratio:** 0.03

## Binning Configuration

- **Number of Bins:** 51
- **Value Range:** [1.81, 8.60] (range of train set)
- **Method:** Uniform binning for regression via classification

## Performance

Evaluated on AVA test set (19,930 samples):

- **Pearson Correlation (PLCC):** 0.841
- **Spearman Correlation (SRCC):** 0.843

## Citation

```bibtex
@inproceedings{jennings2025language,
  title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression},
  author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik},
  booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)},
  year={2026},
  organization={IEEE}
}
```