--- base_model: "Qwen/Qwen2-VL-2B-Instruct" library_name: transformers tags: - image-text-to-text - regression arxiv: 2507.14997 github: https://github.com/royhj/RvTC --- # Regression via Transformer-Based Classification (RvTC) - **Model:** Qwen2-VL-2B-Image-Only [github](https://github.com/royhj/RvTC) - [arxiv](https://arxiv.org/abs/2507.14997) ## Model Description Fine-tuned Qwen2-VL-2B-Instruct model for image aesthetic assessment using the RvTC (Regression via Transformer-Based Classification) framework. This checkpoint uses **image-only training** without textual context. ## Base Model - **Architecture:** Qwen2-VL-2B-Instruct - **Source:** Qwen/Qwen2-VL-2B-Instruct ## Training Configuration - **Dataset:** AVA (Aesthetic Visual Analysis) - **Training Mode:** Image-only (no textual prompts) - **Epochs:** 2 - **Learning Rate:** 1e-5 - **Batch Size:** 128 (training) - **Optimizer:** AdamW with cosine scheduler - **Warmup Ratio:** 0.03 ## Binning Configuration - **Number of Bins:** 51 - **Value Range:** [1.81, 8.60] (range of train set) - **Method:** Uniform binning for regression via classification ## Performance Evaluated on AVA test set (19,930 samples): - **Pearson Correlation (PLCC):** 0.841 - **Spearman Correlation (SRCC):** 0.843 ## Citation ```bibtex @inproceedings{jennings2025language, title={Language Integration in Fine-Tuning Multimodal Large Language Models for Image-Based Regression}, author={Roy H. Jennings, Genady Paikin, Roy Shaul, and Evgeny Soloveichik}, booktitle={2026 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)}, year={2026}, organization={IEEE} } ```