---
license: apache-2.0
language:
- en
- zh
library_name: transformers
pipeline_tag: image-text-to-text
tags:
- vision-language
- multimodal
- aesthetic-assessment
- portrait-craft
- lora
- knowledge-distillation
- qwen3.5
base_model: Qwen/Qwen3.5-4B-VL
---

# PortraitCraft Track-1 — Qwen3.5-4B-VL Fine-tuned

A 4B vision-language model fine-tuned for portrait composition aesthetic assessment, submitted to the **CVPR 2026 Workshop PortraitCraft Challenge Track-1**.

The model jointly predicts:

- **13 fine-grained aesthetic criteria** (each as a continuous 0-10 score)
- **An overall aesthetic score** (integer 1-100)
- **A 4-way multiple-choice VQA answer**

All three are emitted in a single strict-JSON output.

## Quick start

```bash
pip install -r inference/requirements.txt

# Run inference on the official Track-1 test set:
bash run_inference.sh \
    /path/to/track_1_test.json \
    /path/to/PortraitCraft   # directory containing images_00/ images_01/ ...
```

This produces `submission.json` and `submission.zip` in the repo root.

## Inference

The pipeline at inference time uses **2-pass test-time augmentation**:

1. Standard resolution (`max_pixels=1003520`) + original image
2. Standard resolution + horizontally flipped
3. High resolution (`max_pixels=2007040`) + original
4. High resolution + flipped

The continuous criterion scores from all four passes are averaged, and only then mapped to discrete levels by fixed thresholds (`<5→A, 5-7→B, ≥7→C`). The VQA answer is taken from the standard-resolution original pass.

## Output schema

```json
{
  "image_path": "...",
  "criteria": {
    "Color Harmony": {"level": "A|B|C"},
    "Visual Style Consistency": {"level": "A|B|C"},
    "Sharpness": {"level": "A|B|C"},
    "Light and Shadow Modeling": {"level": "A|B|C"},
    "Creativity and Originality": {"level": "A|B|C"},
    "Exposure Control": {"level": "A|B|C"},
    "Application of Classical Composition Principles": {"level": "A|B|C"},
    "Depth of Field and Layering": {"level": "A|B|C"},
    "Visual Center Stability": {"level": "A|B|C"},
    "Visual Flow Guidance": {"level": "A|B|C"},
    "Structural Support Stability": {"level": "A|B|C"},
    "Appropriateness of Negative Space": {"level": "A|B|C"},
    "Subject Integrity": {"level": "A|B|C"}
  },
  "total_score": 65,
  "question": "...",
  "options": {"A": "...", "B": "...", "C": "...", "D": "..."},
  "answer": "A|B|C|D"
}
```

## Environment

Pinned versions for reproducibility (see `inference/requirements.txt`):

| Package | Version |
|---|---|
| vllm | 0.19.1 |
| transformers | 5.5.4 |
| torch | 2.10.0 (CUDA 12.x) |
| Pillow | 11.3.0 |

For best reproduction we recommend running on NVIDIA H20 GPUs (matching the training/inference setup). 

## License

Apache 2.0 (inherited from the Qwen3.5-4B-VL base model).

## Citation

If you use this model, please cite the PortraitCraft challenge and the Qwen3.5 base model.