--- license: apache-2.0 base_model: - prithivMLmods/Qwen3-VL-32B-Instruct-abliterated-v1 pipeline_tag: image-text-to-text tags: - text-generation-inference - uncensored - abliterated - unfiltered - unredacted - vllm - pytorch - fp8 - f8_e4m3 - max - agent language: - en library_name: transformers --- ![1](https://cdn-uploads.huggingface.co/production/uploads/65bb837dbfb878f46c77de4c/arngxpHFh7IIjFKtwtfjM.png) # **Qwen3-VL-32B-Instruct-Unredacted-MAX-FP8** > **Qwen3-VL-32B-Instruct-Unredacted-MAX-FP8** is an FP8-compressed evolution built on top of **prithivMLmods/Qwen3-VL-32B-Instruct-abliterated-v1**. This variant leverages **BF16 · FP8 (F8_E4M3)** precision formats to significantly reduce memory footprint and improve inference efficiency, while preserving the unredacted multimodal reasoning strengths of the original 32B architecture. > The result is a highly capable 32B vision-language model optimized for unrestricted, detailed reasoning and captioning across complex visual inputs, with enhanced hardware efficiency. > [!important] > FP8 (8-bit floating point) weight and activation quantization using hardware acceleration on GPUs – [FP8 W8A8](https://docs.vllm.ai/en/stable/features/quantization/fp8/). Quantization W8A8 FP8-dynamic recipe – [examples](https://github.com/vllm-project/llm-compressor/tree/main/examples/quantization_w8a8_fp8). ## Key Highlights * **BF16 · FP8 (F8_E4M3) Compression**: Transformer Engine–based FP8 quantization reduces VRAM usage and improves throughput while maintaining strong multimodal reasoning fidelity. * **Unredacted MAX Training**: Retains the abliterated fine-tuning strategy designed to minimize internal refusal behaviors and improve instruction adherence. * **32B Parameter Architecture**: Built on top of `prithivMLmods/Qwen3-VL-32B-Instruct-abliterated-v1` (Derived from Qwen/Qwen3-VL-32B-Instruct)., delivering substantially stronger reasoning capacity while benefiting from FP8 efficiency gains. * **Unrestricted Multimodal Reasoning**: Designed for deep analysis of artistic, forensic, technical, or abstract visual content without standard safety-driven refusals. * **High-Fidelity Captions**: Produces dense, descriptive outputs suitable for dataset generation, metadata enrichment, or accessibility use cases. * **Dynamic Resolution Support**: Retains Qwen3-VL’s ability to process varying image resolutions and aspect ratios effectively. * **Optimized Deployment**: FP8 compression enables smoother deployment on Hopper and compatible GPU architectures. ## Quick Start with Transformers ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from qwen_vl_utils import process_vision_info import torch # Load the 32B Instruct Unredacted MAX FP8 model model = Qwen3VLForConditionalGeneration.from_pretrained( "prithivMLmods/Qwen3-VL-32B-Instruct-Unredacted-MAX-FP8", torch_dtype="auto", device_map="auto" ) processor = AutoProcessor.from_pretrained( "prithivMLmods/Qwen3-VL-32B-Instruct-Unredacted-MAX-FP8" ) messages = [ { "role": "user", "content": [ { "type": "image", "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg", }, {"type": "text", "text": "Provide a detailed caption and reasoning for this image."}, ], } ] text = processor.apply_chat_template( messages, tokenize=False, add_generation_prompt=True ) image_inputs, video_inputs = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, videos=video_inputs, padding=True, return_tensors="pt", ).to("cuda") generated_ids = model.generate(**inputs, max_new_tokens=256) generated_ids_trimmed = [ out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids) ] output_text = processor.batch_decode( generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False ) print(output_text) ``` ## Intended Use * **Advanced Red-Teaming**: Evaluating multimodal robustness and probing behavioral edge cases. * **Complex Data Archiving**: Generating detailed captions for medical, artistic, historical, or research datasets. * **Refusal Mechanism Research**: Studying behavioral shifts in vision-language models after abliterated fine-tuning. * **Creative Storytelling**: Producing detailed visual descriptions for narrative and world-building projects. ## Limitations & Risks > **Critical Note**: This model is designed to minimize built-in refusal mechanisms. * **Sensitive Content Exposure**: The model may generate explicit or controversial descriptions if prompted accordingly. * **User Responsibility**: Generated outputs must be handled responsibly and used within ethical and legal boundaries. * **Hardware Requirements**: While lighter than full-precision 32B variants, the FP8 architecture still requires compatible GPU support and sufficient VRAM for high-resolution image processing and extended generations. ## Acknowledgements I would like to thank the works of the following: * Uncensor any LLM with abliteration – [Maxime Labonne](https://huggingface.co/mlabonne) * Using FP8 and FP4 with Transformer Engine – [docs.nvidia](https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html) * Remove Refusals with Transformers – [Sumandora](https://github.com/Sumandora/remove-refusals-with-transformers) * LLM Compressor – [vllm-project](https://github.com/vllm-project/llm-compressor) * FP8 Floating-Point 8: An Introduction to Efficient, Lower-Precision AI Training – [nvidia](https://developer.nvidia.com/blog/floating-point-8-an-introduction-to-efficient-lower-precision-ai-training/)