showlab/ShowUI-desktop
Viewer β’ Updated β’ 7.5k β’ 1.42k β’ 37
LoRA adapter for Qwen3.5-4B fine-tuned on GUI grounding: given a screenshot and a natural language instruction, predict the (x, y) click coordinate of the target UI element.
| Split | Correct | Total | Accuracy |
|---|---|---|---|
| Desktop | 320 | 334 | 95.8% |
| Mobile | 474 | 501 | 94.6% |
| Web | 394 | 437 | 90.2% |
| Overall | 1188 | 1272 | 93.4% |
~23.5K samples from 3 GUI grounding datasets covering desktop, web, and mobile platforms.
<|box_start|>(x,y)<|box_end|>
Coordinates are in [0, 1000] normalized space. To convert to pixel coordinates:
pixel_x = x / 1000 * image_width
pixel_y = y / 1000 * image_height
Requires transformers>=5.2.0 and peft.
from transformers import AutoProcessor, Qwen3_5ForConditionalGeneration
from peft import PeftModel
import torch
base = Qwen3_5ForConditionalGeneration.from_pretrained("Qwen/Qwen3.5-4B", torch_dtype=torch.bfloat16)
model = PeftModel.from_pretrained(base, "dabism23/qwen35-gui-grounding_v2")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-4B")
| Version | ScreenSpot-V2 |
|---|---|
| v1 | 92.5% |
| v2 | 93.4% |
Model weights are gated. Request access to download. Training configuration details are included with the model files.