---
license: apache-2.0
language:
- en
- zh
- th
base_model:
- prithivMLmods/DeepCaption-VLA-7B
- Qwen/Qwen2.5-VL-7B-Instruct
pipeline_tag: image-text-to-text
library_name: transformers
tags:
- trl
- text-generation-inference
- BLIP3-o
- Image-Caption
- VisionLanguageAttribution
- VisualUnderstanding
- AttributeCaptioning
- VLA
- High-Fidelity
- partial-abliteration
---

![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/XrstquEgbF3UUdWS6KRTp.png)

# **DeepCaption-VLA-V2.0-7B**

> **DeepCaption-VLA-V2.0-7B** is an advanced fine-tuned version of **Qwen2.5-VL-7B-Instruct**, specialized for **Image Captioning** and **Vision Language Attribution (VLA)**. This enhanced release focuses on generating **precise, attribute-rich captions** that capture **visual properties, object attributes, and scene details** across diverse image types and aspect ratios.
>
> Version **V2.0** introduces **significant improvements in multilingual inference**, delivering higher captioning quality and attribution accuracy in languages including **Chinese (Zh)**, **Thai (Th)**, and others.

[![Download Demo Notebook](https://img.shields.io/badge/Open%20Demo%20Notebook-DeepCaption--VLA--7B--v2.0-blue?style=for-the-badge&logo=jupyter)](https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb)


## Key Highlights

1. **Vision Language Attribution (VLA):** Fine-tuned to attribute and define visual properties of objects, scenes, and environments with greater semantic precision.
2. **Detailed Object Definitions:** Generates attribute-rich captions, offering deeper visual understanding compared to generic captioning models.
3. **High-Fidelity Descriptions:** Excels at describing general, artistic, technical, abstract, and low-context images with enhanced descriptive detail.
4. **Robust Across Aspect Ratios:** Maintains caption accuracy across various formats — wide, tall, square, or irregular.
5. **Variational Detail Control:** Supports both concise summaries and fine-grained visual attributions depending on prompt structure.
6. **Enhanced Multilingual Inference (New in V2.0):** Optimized for generating accurate and descriptive captions in multiple languages, including **English, Chinese (Zh), Thai (Th)**, and more.
7. **Built on Qwen2.5-VL Architecture:** Leverages the multimodal reasoning capabilities and instruction-following strengths of Qwen2.5-VL-7B.

> model type: experimental

---

## Sample Inferences [en, zh, thai] - <span style="color:red;">[DeepCaption-VLA-V2.0-7B]</span>

| Image 1                                                                                                                                                      | Image 2                                                                                                                                                      |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ![output\_08ab9086-6734-4d7d-a325-a8468dac32a9-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/mh1hT92b8Ze6HGdSWQVJU.jpeg) | ![output\_9e6c2b4e-a250-4eef-a45d-8ee9d901fdb4-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/P5TWKcdtwPL5u6GTeD3p2.jpeg) |
| Image 3                                                                                                                                                      | Image 4                                                                                                                                                      |
| ![output\_50c5b853-e849-453e-8d6a-cd55446b7e5e-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/sDlEEFa9qiB7J_cu1unHR.jpeg) | ![output\_56cd6bc4-1f6e-4834-b949-a386fcef1037-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fTtUeJWVFxA-ziWvXJfVg.jpeg) |
| Image 5                                                                                                                                                      | Image 6                                                                                                                                                      |
| ![output\_56627187-e752-4cdf-93b9-776377908382-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/72S68uC6ngwWLPo0cn2yd.jpeg) | ![output\_cd987d54-5812-41d9-8f75-71036d1f4bd3-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/KWAjGp2vJMjxdiCyCzcGL.jpeg) |
| Image 7 [zh]                                                                                                                                                      | Image 8                                                                                                                                                      |
| ![output\_d5f58601-e303-4ea8-9ee2-ea935dcac1b5-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/i_VKaqUQ5wFtpe7RXDucv.jpeg) | ![output\_d113bd7f-7d7f-4524-a941-ecd4fcd97eb0-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/bpIvfyMHW9zOqz-9VmqMS.jpeg) |
| Image 9 [zh]                                                                                                                                                      | Image 10 [thai]                                                                                                                                                    |
| ![output\_d5217ad1-10de-4bce-811c-b10658eecd7f-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/QIU-N-Te3J9u7xo10pJE6.jpeg) | ![output\_f0387f11-4a61-4848-8cba-e32e422374b2-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/uvV4jIduKKqvRX52slUlG.jpeg) |

---

## Comparison of Inference: Qwen2.5-VL-7B vs. <span style="color:red;">DeepCaption-VLA-V2.0-7B</span>

| **Qwen2.5-VL-7B-Instruct**                                                                                                                                   | **DeepCaption-VLA-V2.0-7B**                                                                                                                                  |
| ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| ![output\_5430db23-c599-440f-aa4a-b05ff91d9d91-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/5R0uCbtXoc2Y5Oasioy3f.jpeg) | ![output\_5d06a443-21ad-4bfd-8de5-6345f3383b62-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/jvi5mflhct3WGCXFrWc8A.jpeg) |
| ![output\_baf07cf2-07d7-4877-98db-7fa040745e23-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/USEkEYrGo_whZaDx-N7Bo.jpeg) | ![DeepCaption-VLA-V2.0-7B!!!!!](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/atiIvDdMVCsHd6Ip4WOyQ.jpeg)                   |

---

## Example of a Recommended System Instruction

```python
CAPTION_SYSTEM_PROMPT = """
You are an AI assistant that rigorously follows this response protocol:

1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language.

2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics.

3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format.  
   - Use the syntax: `{class_name==write_the_core_theme}`  
   - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}`  

4. Maintain the following strict format in your output:
   - **Caption:** <one-sentence description>  
   - **Attributes:** <comma-separated list of visual attributes>  
   - **{class_name==core_theme}**

5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required.

6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name.
"""
```

---

## Quick Start with Transformers

```python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/DeepCaption-VLA-V2.0-7B", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-V2.0-7B")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Describe this image with detailed attributes and properties."},
        ],
    }
]

text = processor.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=128)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)
```

---

## Intended Use

* Generating attribute-rich image captions for research, dataset creation, and AI training.
* Vision-language attribution for object detection, scene understanding, and dataset annotation.
* Supporting creative, artistic, and technical applications requiring descriptive image understanding.
* Captioning across varied aspect ratios, non-standard datasets, and multilingual contexts.

## Limitations

* May over-attribute or infer properties not explicitly visible in ambiguous or low-resolution images.
* Caption tone and level of detail may vary depending on prompt phrasing.
* Not intended for filtered captioning tasks; explicit or sensitive content may still appear. 
* Performance may degrade slightly on highly synthetic or abstract visual domains.