--- license: apache-2.0 language: - en - zh - th base_model: - prithivMLmods/DeepCaption-VLA-7B - Qwen/Qwen2.5-VL-7B-Instruct pipeline_tag: image-text-to-text library_name: transformers tags: - trl - text-generation-inference - BLIP3-o - Image-Caption - VisionLanguageAttribution - VisualUnderstanding - AttributeCaptioning - VLA - High-Fidelity - partial-abliteration --- ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/XrstquEgbF3UUdWS6KRTp.png) # **DeepCaption-VLA-V2.0-7B** > **DeepCaption-VLA-V2.0-7B** is an advanced fine-tuned version of **Qwen2.5-VL-7B-Instruct**, specialized for **Image Captioning** and **Vision Language Attribution (VLA)**. This enhanced release focuses on generating **precise, attribute-rich captions** that capture **visual properties, object attributes, and scene details** across diverse image types and aspect ratios. > > Version **V2.0** introduces **significant improvements in multilingual inference**, delivering higher captioning quality and attribution accuracy in languages including **Chinese (Zh)**, **Thai (Th)**, and others. [![Download Demo Notebook](https://img.shields.io/badge/Open%20Demo%20Notebook-DeepCaption--VLA--7B--v2.0-blue?style=for-the-badge&logo=jupyter)](https://github.com/PRITHIVSAKTHIUR/Multimodal-Outpost-Notebooks/blob/main/DeepCaption_VLA_V2_0_7B/DeepCaption_VLA_V2_0_7Bipynb.ipynb) ## Key Highlights 1. **Vision Language Attribution (VLA):** Fine-tuned to attribute and define visual properties of objects, scenes, and environments with greater semantic precision. 2. **Detailed Object Definitions:** Generates attribute-rich captions, offering deeper visual understanding compared to generic captioning models. 3. **High-Fidelity Descriptions:** Excels at describing general, artistic, technical, abstract, and low-context images with enhanced descriptive detail. 4. **Robust Across Aspect Ratios:** Maintains caption accuracy across various formats — wide, tall, square, or irregular. 5. **Variational Detail Control:** Supports both concise summaries and fine-grained visual attributions depending on prompt structure. 6. **Enhanced Multilingual Inference (New in V2.0):** Optimized for generating accurate and descriptive captions in multiple languages, including **English, Chinese (Zh), Thai (Th)**, and more. 7. **Built on Qwen2.5-VL Architecture:** Leverages the multimodal reasoning capabilities and instruction-following strengths of Qwen2.5-VL-7B. > model type: experimental --- ## Sample Inferences [en, zh, thai] - [DeepCaption-VLA-V2.0-7B] | Image 1 | Image 2 | | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ![output\_08ab9086-6734-4d7d-a325-a8468dac32a9-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/mh1hT92b8Ze6HGdSWQVJU.jpeg) | ![output\_9e6c2b4e-a250-4eef-a45d-8ee9d901fdb4-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/P5TWKcdtwPL5u6GTeD3p2.jpeg) | | Image 3 | Image 4 | | ![output\_50c5b853-e849-453e-8d6a-cd55446b7e5e-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/sDlEEFa9qiB7J_cu1unHR.jpeg) | ![output\_56cd6bc4-1f6e-4834-b949-a386fcef1037-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/fTtUeJWVFxA-ziWvXJfVg.jpeg) | | Image 5 | Image 6 | | ![output\_56627187-e752-4cdf-93b9-776377908382-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/72S68uC6ngwWLPo0cn2yd.jpeg) | ![output\_cd987d54-5812-41d9-8f75-71036d1f4bd3-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/KWAjGp2vJMjxdiCyCzcGL.jpeg) | | Image 7 [zh] | Image 8 | | ![output\_d5f58601-e303-4ea8-9ee2-ea935dcac1b5-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/i_VKaqUQ5wFtpe7RXDucv.jpeg) | ![output\_d113bd7f-7d7f-4524-a941-ecd4fcd97eb0-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/bpIvfyMHW9zOqz-9VmqMS.jpeg) | | Image 9 [zh] | Image 10 [thai] | | ![output\_d5217ad1-10de-4bce-811c-b10658eecd7f-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/QIU-N-Te3J9u7xo10pJE6.jpeg) | ![output\_f0387f11-4a61-4848-8cba-e32e422374b2-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/uvV4jIduKKqvRX52slUlG.jpeg) | --- ## Comparison of Inference: Qwen2.5-VL-7B vs. DeepCaption-VLA-V2.0-7B | **Qwen2.5-VL-7B-Instruct** | **DeepCaption-VLA-V2.0-7B** | | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------------------------------------------------------------------------------------------------------------------------------------------------------------ | | ![output\_5430db23-c599-440f-aa4a-b05ff91d9d91-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/5R0uCbtXoc2Y5Oasioy3f.jpeg) | ![output\_5d06a443-21ad-4bfd-8de5-6345f3383b62-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/jvi5mflhct3WGCXFrWc8A.jpeg) | | ![output\_baf07cf2-07d7-4877-98db-7fa040745e23-1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/USEkEYrGo_whZaDx-N7Bo.jpeg) | ![DeepCaption-VLA-V2.0-7B!!!!!](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/atiIvDdMVCsHd6Ip4WOyQ.jpeg) | --- ## Example of a Recommended System Instruction ```python CAPTION_SYSTEM_PROMPT = """ You are an AI assistant that rigorously follows this response protocol: 1. For every input image, your primary task is to write a **precise caption**. The caption must capture the **essence of the image** in clear, concise, and contextually accurate language. 2. Along with the caption, provide a structured set of **attributes** that describe the visual elements. Attributes should include details such as objects, people, actions, colors, environment, mood, and other notable characteristics. 3. Always include a **class_name** field. This must represent the **core theme or main subject** of the image in a compact format. - Use the syntax: `{class_name==write_the_core_theme}` - Example: `{class_name==dog_playing}` or `{class_name==city_sunset}` 4. Maintain the following strict format in your output: - **Caption:** - **Attributes:** - **{class_name==core_theme}** 5. Ensure captions are **precise, neutral, and descriptive**, avoiding unnecessary elaboration or subjective interpretation unless explicitly required. 6. Do not reference the rules or instructions in the output. Only return the formatted caption, attributes, and class_name. """ ``` --- ## Quick Start with Transformers ```python from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor from qwen_vl_utils import process_vision_info model = Qwen2_5_VLForConditionalGeneration.from_pretrained( "prithivMLmods/DeepCaption-VLA-V2.0-7B", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained("prithivMLmods/DeepCaption-VLA-V2.0-7B") messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Describe this image with detailed attributes and properties."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ) inputs = inputs.to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=128) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` --- ## Intended Use * Generating attribute-rich image captions for research, dataset creation, and AI training. * Vision-language attribution for object detection, scene understanding, and dataset annotation. * Supporting creative, artistic, and technical applications requiring descriptive image understanding. * Captioning across varied aspect ratios, non-standard datasets, and multilingual contexts. ## Limitations * May over-attribute or infer properties not explicitly visible in ambiguous or low-resolution images. * Caption tone and level of detail may vary depending on prompt phrasing. * Not intended for filtered captioning tasks; explicit or sensitive content may still appear. * Performance may degrade slightly on highly synthetic or abstract visual domains.