You need to agree to share your contact information to access this model

This repository is publicly accessible, but you have to accept the conditions to access its files and content.

Qwen3.5-4B GUI Grounding — v2 (SFT LoRA)

LoRA adapter for Qwen3.5-4B fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element.

Results — ScreenSpot-V2

Split	Correct	Total	Accuracy
Desktop	320	334	95.8%
Mobile	474	501	94.6%
Web	394	437	90.2%
Overall	1188	1272	93.4%

Training Data

~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms.

Output Format

<|box_start|>(x,y)<|box_end|>

Coordinates are in [0, 1000] normalized space. To convert to pixel coordinates:

pixel_x = x / 1000 * image_width
pixel_y = y / 1000 * image_height

Usage

Requires transformers>=5.2.0 and peft.

from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
import torch

base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v2")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")

Version History

Version	ScreenSpot-V2
v1	92.5%
v2	93.4%

Access

Model weights are gated. Request access to download. Training configuration details are included with the model files.

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

Image-Text-to-Text

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for mdabis/qwen35-gui-grounding_v2

Base model

Qwen/Qwen3.5-4B-Base

Finetuned

Qwen/Qwen3.5-4B

Adapter

(221)

this model

mdabis
/

qwen35-gui-grounding_v2