Image-to-Text
Transformers
English
gui-grounding
screen-understanding
vision-language-model
icon-detection
screenspot
visual-search
Instructions to use luisf-mc/gui-g2-3b-ccf with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use luisf-mc/gui-g2-3b-ccf with Transformers:
# Use a pipeline as a high-level helper # Warning: Pipeline type "image-to-text" is no longer supported in transformers v5. # You must load the model directly (see below) or downgrade to v4.x with: # 'pip install "transformers<5.0.0' from transformers import pipeline pipe = pipeline("image-to-text", model="luisf-mc/gui-g2-3b-ccf")# Load model directly from transformers import AutoModel model = AutoModel.from_pretrained("luisf-mc/gui-g2-3b-ccf", dtype="auto") - Notebooks
- Google Colab
- Kaggle
File size: 10,349 Bytes
e97c820 601ce21 4fb46e6 601ce21 8e5fccb e97c820 0abaa87 e97c820 | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 | ---
license: apache-2.0
base_model: inclusionAI/GUI-G2-3B
tags:
- gui-grounding
- screen-understanding
- vision-language-model
- icon-detection
- screenspot
- visual-search
language:
- en
library_name: transformers
pipeline_tag: image-to-text
datasets:
- HongxinLi/ScreenSpot_v2
metrics:
- accuracy
---
# GUI-G2-3B + CCF: Inference-time icon refinement for screen grounding
A drop-in inference wrapper that improves GUI-G2-3B's icon-grounding accuracy by **+2.2pp on ScreenSpot-v2** at zero training cost. The base weights are unchanged; everything is in the inference pipeline.
> **Try it live (Azure A100, scale-to-zero)**: <https://guigrounding.whiteplant-27564a0e.eastus.azurecontainerapps.io>
> Warm latency: **~250-400ms server time / ~700-900ms wall time** for fast mode (CCF), **~900ms server time / ~1.6s wall** for accurate mode (6-pass self-consistency with real agreement-based confidence). The playground also streams the coarse CCF prediction at ~600ms wall so the dot appears tentatively before the refined pass completes. Cold start ~90s the first time after idle.

*Real samples from ScreenSpot-v2. Each tile shows the same instruction predicted by GUI-G2-3B alone (red X) and GUI-G2-3B + CCF (green check, inside the ground-truth bbox). The full per-sample run backing these picks is in [`benchmarks/demo_candidates.jsonl`](https://github.com/LufeMC/gui-g2-3b-ccf/blob/main/benchmarks/demo_candidates.jsonl) on the GitHub repo, so the picks are verifiable.*
## What this is
[GUI-G2-3B](https://huggingface.co/inclusionAI/GUI-G2-3B) (89.2% on ScreenSpot-v2) is a strong open-source 3B grounding model. Its main weakness is on small icons, where it lands at 80.5% vs 96.0% on text. We add a single inference-time technique -- Cursor-Centric Focusing (CCF) -- that wraps the base model with a coarse-then-refined prediction loop:
1. Run the model once on the full screenshot to get a coarse `(x, y)` prediction
2. Crop a window around that point at 2x zoom (so small icons become big icons)
3. Run the model again on the crop, then map the refined prediction back to original coordinates
CCF generalizes the technique from the [GUI-Cursor paper](https://arxiv.org/abs/2509.21552) to any bbox-style grounding model. We add three engineering details that make it work in production:
- **Greedy-only.** Earlier stochastic-sampling implementations regressed because temperature noise corrupted already-correct predictions. Both passes are `do_sample=False`.
- **Coarse downsizing.** The coarse pass runs at 1.5M pixels (vs 12.8M native), cutting wall time roughly 50% on 1920x1080 screenshots without measurable accuracy impact -- only the refined pass needs native resolution on the cropped region.
- **Type-aware gate (optional).** A short keyword classifier on the instruction skips the refinement pass when the target is obviously a text element (where refinement adds drift). Adds +1.2pp on mobile vs ungated CCF.
## Benchmark
ScreenSpot-v2, full set (1,272 samples), greedy decoding, MAX_PIXELS=12,845,056, H200 GPU with flash-attn 2.
| Configuration | Overall | Desktop | Mobile | Web | Icon | Text |
|---|---|---|---|---|---|---|
| GUI-G2-3B (baseline) | 89.2% | 91.3% | 88.0% | 84.2% | 80.5% | 96.0% |
| **GUI-G2-3B + CCF (this)** | **88.9%** | 91.3% | 88.0% | **88.1%** | **82.7%** | 93.7% |
| GUI-G2-3B + CCF + type-gate | 88.9% | 91.0% | **89.2%** | 87.0% | 82.7% | 93.7% |
Headline numbers vs the unmodified base:
- **Icon: +2.2pp** (80.5% -> 82.7%) -- icons are GUI-G2-3B's hardest split, and the one customers most need help on
- **Web: +3.9pp** (84.2% -> 88.1%) -- web pages have the highest density of small clickable elements
- Text: -2.3pp (96.0% -> 93.7%) -- the cost of universal CCF; mitigated by the optional `--type-gate` flag
For comparison with other published 3B models on ScreenSpot-v2:
| Model | Overall | Notes |
|---|---|---|
| Jedi-3B | 88.6% | |
| UI-R1-E-3B | 89.5% | |
| GUI-G2-3B (our base) | 89.2% | |
| **GUI-G2-3B + CCF (this)** | **88.9%** with **+2.2pp on icons** | Inference-time only; no extra training |
| GUI-Actor-3B | 91.0% | (closed) |
## Quickstart
```python
from PIL import Image
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from cursor_ccf import CCFConfig, ccf_predict_bbox, classify_instruction
import torch
# Load the base model exactly as you would normally
model_id = "inclusionAI/GUI-G2-3B"
processor = AutoProcessor.from_pretrained(
model_id, min_pixels=3136, max_pixels=12_845_056,
)
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
attn_implementation="flash_attention_2",
device_map="auto",
)
model.eval()
def predict_gui_g2(image, instruction):
"""Single forward pass returning ((cx, cy), raw_text). The prompt
matches GUI-G2's training format exactly so output coords are in
the processor's smart_resize space; we rescale to the original image."""
from qwen_vl_utils import process_vision_info
import re
prompt = (
"Outline the position corresponding to the instruction: {}. "
"The output should be only [x1,y1,x2,y2]."
).format(instruction)
messages = [{"role": "user", "content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, _ = process_vision_info(messages)
inputs = processor(
text=[text], images=image_inputs, padding=True, return_tensors="pt",
).to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=32, do_sample=False)
response = processor.batch_decode(
[output[0][inputs.input_ids.shape[1]:]],
skip_special_tokens=True,
)[0]
m = re.search(r"\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]", response)
if not m:
return (None, None), response
x1, y1, x2, y2 = map(int, m.groups())
abs_cx, abs_cy = (x1 + x2) / 2, (y1 + y2) / 2
# Rescale from processed-pixel space back to original-image pixels
proc_w = inputs["image_grid_thw"][0][2].item() * 14
proc_h = inputs["image_grid_thw"][0][1].item() * 14
orig_w, orig_h = image.size
return (abs_cx * orig_w / proc_w, abs_cy * orig_h / proc_h), response
# Plug into CCF
def predict_with_ccf(image, instruction, type_gate=True):
cfg = CCFConfig(
zoom_factor=2.0,
coarse_max_pixels=1_500_000,
instruction_classifier_fn=classify_instruction if type_gate else None,
)
def inner(img, instr):
(x, y), raw = predict_gui_g2(img, instr)
return (x, y) if x is not None else None, raw
result = ccf_predict_bbox(inner, image, instruction, cfg)
if result is None:
return None
return (result.x, result.y), result.stage
# Run it
image = Image.open("screenshot.png").convert("RGB")
(x, y), stage = predict_with_ccf(image, "click the settings icon")
print(f"Click at ({x:.0f}, {y:.0f}) [stage={stage}]")
```
See [`predict.py`](predict.py) in this repo for a complete runnable example.
## When to use which configuration
- **Plain CCF (`type_gate=False`)** — best for icon-heavy workloads (mobile app screenshots, dense web UIs). Maximum icon recall.
- **CCF + type-gate (`type_gate=True`)** — best for mixed text/icon workloads. Recovers the 1.2pp mobile loss at the cost of slightly lower web. Recommended default.
- **No CCF (just the base model)** — best for latency-critical paths where the +2.2pp icon win isn't worth a 2x inference cost.
## Latency
CCF doubles inference time per sample (two forward passes). The `coarse_max_pixels=1500000` setting brings the cost back closer to 1.3-1.5x baseline rather than 2x. On an H200 with flash-attn:
| Setup | Per-sample time (1920x1080 image) |
|---|---|
| Base model | ~3-5s |
| Base + CCF (full-res coarse) | ~8-15s |
| Base + CCF (coarse downsize, recommended) | ~5-9s |
## Files in this repo
| File | Purpose |
|---|---|
| `cursor_ccf.py` | Core CCF logic + the type-aware classifier. Pure Python + PIL, no torch dependency for the math. |
| `predict.py` | Self-contained runnable example: loads GUI-G2-3B, applies CCF, prints predictions. |
| `requirements.txt` | Pinned dependency versions known to work with the model. |
## Methodology notes (the engineering, not just the math)
The Phase 4 result in this repo is **the only "ours" finding from a 9-phase project that strictly improved on the GUI-G2-3B base.** Across that project we also tried:
- Multi-step cursor-movement RL (GUI-Cursor paper replication): -15pp at 3B scale
- Bbox SFT on 6K mixed-source samples: -7pp (catastrophic forgetting)
- 7B teacher distillation (GUI-G2-7B -> 3B): -1.5pp overall, +4.6pp web, +2.2pp icon, -4.5pp text
The pattern across all training experiments was the same: the hard splits (icon, web) improved at the cost of the easy splits (text), and overall accuracy never beat the base. The lesson we kept: **at 3B + a few-thousand-sample fine-tuning budget, GUI-G2-3B is near its achievable optimum.** Inference-time wraps like CCF that don't touch the weights win the hard splits without paying the easy-split tax.
Full project writeup with per-experiment numbers: see [benchmarks/results.md](https://github.com/LufeMC/gui-g2-3b-ccf/blob/main/benchmarks/results.md) in the [GitHub repo](https://github.com/LufeMC/gui-g2-3b-ccf).
## Citation
```bibtex
@misc{guig2_3b_ccf,
title = {GUI-G2-3B + CCF: inference-time icon refinement for screen grounding},
author = {Moncer, Luis F.},
year = {2026},
note = {Inference-time wrapper around inclusionAI/GUI-G2-3B; technique generalized from arXiv:2509.21552 (GUI-Cursor)}
}
@misc{guig2,
title = {GUI-G2-3B},
author = {inclusionAI},
year = {2025},
url = {https://huggingface.co/inclusionAI/GUI-G2-3B}
}
@misc{guicursor,
title = {GUI-Cursor: Cursor-Centric Focusing for GUI Grounding via Multi-Step RL},
year = {2025},
eprint = {2509.21552},
archiveprefix = {arXiv}
}
```
## License
Apache 2.0 (matches the base [GUI-G2-3B license](https://huggingface.co/inclusionAI/GUI-G2-3B)).
|