--- license: apache-2.0 base_model: Qwen/Qwen3-VL-4B-Instruct tags: - vision - gui-agent - fine-tuned - qwen3-vl library_name: transformers pipeline_tag: image-text-to-text --- # Fine-tuned Qwen3-VL-4B for GUI Click Actions This model is a fine-tuned version of [Qwen/Qwen3-VL-4B-Instruct](https://huggingface.co/Qwen/Qwen3-VL-4B-Instruct) trained on GUI trajectory data for click action prediction. ## Training Details - **Base Model:** Qwen/Qwen3-VL-4B-Instruct - **Training Checkpoint:** Step 137, Epoch 0 - **Task:** Predict click coordinates from screenshot + instruction ## Usage ### With Transformers ```python from transformers import Qwen3VLForConditionalGeneration, AutoProcessor from PIL import Image model = Qwen3VLForConditionalGeneration.from_pretrained( "BLR2/qwen3-vl-4b-gui-agent", torch_dtype="auto", device_map="auto", ) processor = AutoProcessor.from_pretrained("BLR2/qwen3-vl-4b-gui-agent") # Load your screenshot image = Image.open("screenshot.png") instruction = "Click on the search button" messages = [ { "role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": instruction}, ], } ] inputs = processor.apply_chat_template( messages, tokenize=True, add_generation_prompt=True, return_dict=True, return_tensors="pt" ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=50) response = processor.decode(outputs[0], skip_special_tokens=True) print(response) # Outputs coordinates like "0.5234 0.7891" ``` ### With vLLM ```bash vllm serve BLR2/qwen3-vl-4b-gui-agent --dtype bfloat16 ``` ```python from openai import OpenAI client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy") # For vLLM with vision, encode image as base64 response = client.chat.completions.create( model="BLR2/qwen3-vl-4b-gui-agent", messages=[ { "role": "user", "content": [ {"type": "image_url", "image_url": {"url": "data:image/png;base64,{base64_image}"}}, {"type": "text", "text": "Click on the search button"} ] } ], max_tokens=50 ) print(response.choices[0].message.content) ``` ## Output Format The model outputs normalized coordinates in the format: `x y` where both values are in range [0, 1]. To convert to pixel coordinates: ```python x_norm, y_norm = map(float, output.split()) x_pixel = int(x_norm * image_width) y_pixel = int(y_norm * image_height) ```