--- license: apache-2.0 language: - en - zh library_name: transformers pipeline_tag: image-text-to-text tags: - vision-language - multimodal - aesthetic-assessment - portrait-craft - lora - knowledge-distillation - qwen3.5 base_model: Qwen/Qwen3.5-4B-VL --- # PortraitCraft Track-1 — Qwen3.5-4B-VL Fine-tuned A 4B vision-language model fine-tuned for portrait composition aesthetic assessment, submitted to the **CVPR 2026 Workshop PortraitCraft Challenge Track-1**. The model jointly predicts: - **13 fine-grained aesthetic criteria** (each as a continuous 0-10 score) - **An overall aesthetic score** (integer 1-100) - **A 4-way multiple-choice VQA answer** All three are emitted in a single strict-JSON output. ## Quick start ```bash pip install -r inference/requirements.txt # Run inference on the official Track-1 test set: bash run_inference.sh \ /path/to/track_1_test.json \ /path/to/PortraitCraft # directory containing images_00/ images_01/ ... ``` This produces `submission.json` and `submission.zip` in the repo root. ## Inference The pipeline at inference time uses **2-pass test-time augmentation**: 1. Standard resolution (`max_pixels=1003520`) + original image 2. Standard resolution + horizontally flipped 3. High resolution (`max_pixels=2007040`) + original 4. High resolution + flipped The continuous criterion scores from all four passes are averaged, and only then mapped to discrete levels by fixed thresholds (`<5→A, 5-7→B, ≥7→C`). The VQA answer is taken from the standard-resolution original pass. ## Output schema ```json { "image_path": "...", "criteria": { "Color Harmony": {"level": "A|B|C"}, "Visual Style Consistency": {"level": "A|B|C"}, "Sharpness": {"level": "A|B|C"}, "Light and Shadow Modeling": {"level": "A|B|C"}, "Creativity and Originality": {"level": "A|B|C"}, "Exposure Control": {"level": "A|B|C"}, "Application of Classical Composition Principles": {"level": "A|B|C"}, "Depth of Field and Layering": {"level": "A|B|C"}, "Visual Center Stability": {"level": "A|B|C"}, "Visual Flow Guidance": {"level": "A|B|C"}, "Structural Support Stability": {"level": "A|B|C"}, "Appropriateness of Negative Space": {"level": "A|B|C"}, "Subject Integrity": {"level": "A|B|C"} }, "total_score": 65, "question": "...", "options": {"A": "...", "B": "...", "C": "...", "D": "..."}, "answer": "A|B|C|D" } ``` ## Environment Pinned versions for reproducibility (see `inference/requirements.txt`): | Package | Version | |---|---| | vllm | 0.19.1 | | transformers | 5.5.4 | | torch | 2.10.0 (CUDA 12.x) | | Pillow | 11.3.0 | For best reproduction we recommend running on NVIDIA H20 GPUs (matching the training/inference setup). ## License Apache 2.0 (inherited from the Qwen3.5-4B-VL base model). ## Citation If you use this model, please cite the PortraitCraft challenge and the Qwen3.5 base model.