--- license: apache-2.0 base_model: inclusionAI/GUI-G2-3B tags: - gui-grounding - screen-understanding - vision-language-model - icon-detection - screenspot - visual-search language: - en library_name: transformers pipeline_tag: image-to-text datasets: - HongxinLi/ScreenSpot_v2 metrics: - accuracy --- # GUI-G2-3B + CCF: Inference-time icon refinement for screen grounding A drop-in inference wrapper that improves GUI-G2-3B's icon-grounding accuracy by **+2.2pp on ScreenSpot-v2** at zero training cost. The base weights are unchanged; everything is in the inference pipeline. > **Try it live (Azure A100, scale-to-zero)**: > Warm latency: **~250-400ms server time / ~700-900ms wall time** for fast mode (CCF), **~900ms server time / ~1.6s wall** for accurate mode (6-pass self-consistency with real agreement-based confidence). The playground also streams the coarse CCF prediction at ~600ms wall so the dot appears tentatively before the refined pass completes. Cold start ~90s the first time after idle. ![Side-by-side: 4 ScreenSpot-v2 icons where GUI-G2-3B baseline misses (red X) and CCF hits (green check). Blue boxes are ground truth.](comparison.png) *Real samples from ScreenSpot-v2. Each tile shows the same instruction predicted by GUI-G2-3B alone (red X) and GUI-G2-3B + CCF (green check, inside the ground-truth bbox). The full per-sample run backing these picks is in [`benchmarks/demo_candidates.jsonl`](https://github.com/LufeMC/gui-g2-3b-ccf/blob/main/benchmarks/demo_candidates.jsonl) on the GitHub repo, so the picks are verifiable.* ## What this is [GUI-G2-3B](https://huggingface.co/inclusionAI/GUI-G2-3B) (89.2% on ScreenSpot-v2) is a strong open-source 3B grounding model. Its main weakness is on small icons, where it lands at 80.5% vs 96.0% on text. We add a single inference-time technique -- Cursor-Centric Focusing (CCF) -- that wraps the base model with a coarse-then-refined prediction loop: 1. Run the model once on the full screenshot to get a coarse `(x, y)` prediction 2. Crop a window around that point at 2x zoom (so small icons become big icons) 3. Run the model again on the crop, then map the refined prediction back to original coordinates CCF generalizes the technique from the [GUI-Cursor paper](https://arxiv.org/abs/2509.21552) to any bbox-style grounding model. We add three engineering details that make it work in production: - **Greedy-only.** Earlier stochastic-sampling implementations regressed because temperature noise corrupted already-correct predictions. Both passes are `do_sample=False`. - **Coarse downsizing.** The coarse pass runs at 1.5M pixels (vs 12.8M native), cutting wall time roughly 50% on 1920x1080 screenshots without measurable accuracy impact -- only the refined pass needs native resolution on the cropped region. - **Type-aware gate (optional).** A short keyword classifier on the instruction skips the refinement pass when the target is obviously a text element (where refinement adds drift). Adds +1.2pp on mobile vs ungated CCF. ## Benchmark ScreenSpot-v2, full set (1,272 samples), greedy decoding, MAX_PIXELS=12,845,056, H200 GPU with flash-attn 2. | Configuration | Overall | Desktop | Mobile | Web | Icon | Text | |---|---|---|---|---|---|---| | GUI-G2-3B (baseline) | 89.2% | 91.3% | 88.0% | 84.2% | 80.5% | 96.0% | | **GUI-G2-3B + CCF (this)** | **88.9%** | 91.3% | 88.0% | **88.1%** | **82.7%** | 93.7% | | GUI-G2-3B + CCF + type-gate | 88.9% | 91.0% | **89.2%** | 87.0% | 82.7% | 93.7% | Headline numbers vs the unmodified base: - **Icon: +2.2pp** (80.5% -> 82.7%) -- icons are GUI-G2-3B's hardest split, and the one customers most need help on - **Web: +3.9pp** (84.2% -> 88.1%) -- web pages have the highest density of small clickable elements - Text: -2.3pp (96.0% -> 93.7%) -- the cost of universal CCF; mitigated by the optional `--type-gate` flag For comparison with other published 3B models on ScreenSpot-v2: | Model | Overall | Notes | |---|---|---| | Jedi-3B | 88.6% | | | UI-R1-E-3B | 89.5% | | | GUI-G2-3B (our base) | 89.2% | | | **GUI-G2-3B + CCF (this)** | **88.9%** with **+2.2pp on icons** | Inference-time only; no extra training | | GUI-Actor-3B | 91.0% | (closed) | ## Quickstart ```python from PIL import Image from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration from cursor_ccf import CCFConfig, ccf_predict_bbox, classify_instruction import torch # Load the base model exactly as you would normally model_id = "inclusionAI/GUI-G2-3B" processor = AutoProcessor.from_pretrained( model_id, min_pixels=3136, max_pixels=12_845_056, ) model = Qwen2_5_VLForConditionalGeneration.from_pretrained( model_id, torch_dtype=torch.bfloat16, attn_implementation="flash_attention_2", device_map="auto", ) model.eval() def predict_gui_g2(image, instruction): """Single forward pass returning ((cx, cy), raw_text). The prompt matches GUI-G2's training format exactly so output coords are in the processor's smart_resize space; we rescale to the original image.""" from qwen_vl_utils import process_vision_info import re prompt = ( "Outline the position corresponding to the instruction: {}. " "The output should be only [x1,y1,x2,y2]." ).format(instruction) messages = [{"role": "user", "content": [ {"type": "image", "image": image}, {"type": "text", "text": prompt}, ]}] text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) image_inputs, _ = process_vision_info(messages) inputs = processor( text=[text], images=image_inputs, padding=True, return_tensors="pt", ).to(model.device) with torch.no_grad(): output = model.generate(**inputs, max_new_tokens=32, do_sample=False) response = processor.batch_decode( [output[0][inputs.input_ids.shape[1]:]], skip_special_tokens=True, )[0] m = re.search(r"\[(\d+),\s*(\d+),\s*(\d+),\s*(\d+)\]", response) if not m: return (None, None), response x1, y1, x2, y2 = map(int, m.groups()) abs_cx, abs_cy = (x1 + x2) / 2, (y1 + y2) / 2 # Rescale from processed-pixel space back to original-image pixels proc_w = inputs["image_grid_thw"][0][2].item() * 14 proc_h = inputs["image_grid_thw"][0][1].item() * 14 orig_w, orig_h = image.size return (abs_cx * orig_w / proc_w, abs_cy * orig_h / proc_h), response # Plug into CCF def predict_with_ccf(image, instruction, type_gate=True): cfg = CCFConfig( zoom_factor=2.0, coarse_max_pixels=1_500_000, instruction_classifier_fn=classify_instruction if type_gate else None, ) def inner(img, instr): (x, y), raw = predict_gui_g2(img, instr) return (x, y) if x is not None else None, raw result = ccf_predict_bbox(inner, image, instruction, cfg) if result is None: return None return (result.x, result.y), result.stage # Run it image = Image.open("screenshot.png").convert("RGB") (x, y), stage = predict_with_ccf(image, "click the settings icon") print(f"Click at ({x:.0f}, {y:.0f}) [stage={stage}]") ``` See [`predict.py`](predict.py) in this repo for a complete runnable example. ## When to use which configuration - **Plain CCF (`type_gate=False`)** — best for icon-heavy workloads (mobile app screenshots, dense web UIs). Maximum icon recall. - **CCF + type-gate (`type_gate=True`)** — best for mixed text/icon workloads. Recovers the 1.2pp mobile loss at the cost of slightly lower web. Recommended default. - **No CCF (just the base model)** — best for latency-critical paths where the +2.2pp icon win isn't worth a 2x inference cost. ## Latency CCF doubles inference time per sample (two forward passes). The `coarse_max_pixels=1500000` setting brings the cost back closer to 1.3-1.5x baseline rather than 2x. On an H200 with flash-attn: | Setup | Per-sample time (1920x1080 image) | |---|---| | Base model | ~3-5s | | Base + CCF (full-res coarse) | ~8-15s | | Base + CCF (coarse downsize, recommended) | ~5-9s | ## Files in this repo | File | Purpose | |---|---| | `cursor_ccf.py` | Core CCF logic + the type-aware classifier. Pure Python + PIL, no torch dependency for the math. | | `predict.py` | Self-contained runnable example: loads GUI-G2-3B, applies CCF, prints predictions. | | `requirements.txt` | Pinned dependency versions known to work with the model. | ## Methodology notes (the engineering, not just the math) The Phase 4 result in this repo is **the only "ours" finding from a 9-phase project that strictly improved on the GUI-G2-3B base.** Across that project we also tried: - Multi-step cursor-movement RL (GUI-Cursor paper replication): -15pp at 3B scale - Bbox SFT on 6K mixed-source samples: -7pp (catastrophic forgetting) - 7B teacher distillation (GUI-G2-7B -> 3B): -1.5pp overall, +4.6pp web, +2.2pp icon, -4.5pp text The pattern across all training experiments was the same: the hard splits (icon, web) improved at the cost of the easy splits (text), and overall accuracy never beat the base. The lesson we kept: **at 3B + a few-thousand-sample fine-tuning budget, GUI-G2-3B is near its achievable optimum.** Inference-time wraps like CCF that don't touch the weights win the hard splits without paying the easy-split tax. Full project writeup with per-experiment numbers: see [benchmarks/results.md](https://github.com/LufeMC/gui-g2-3b-ccf/blob/main/benchmarks/results.md) in the [GitHub repo](https://github.com/LufeMC/gui-g2-3b-ccf). ## Citation ```bibtex @misc{guig2_3b_ccf, title = {GUI-G2-3B + CCF: inference-time icon refinement for screen grounding}, author = {Moncer, Luis F.}, year = {2026}, note = {Inference-time wrapper around inclusionAI/GUI-G2-3B; technique generalized from arXiv:2509.21552 (GUI-Cursor)} } @misc{guig2, title = {GUI-G2-3B}, author = {inclusionAI}, year = {2025}, url = {https://huggingface.co/inclusionAI/GUI-G2-3B} } @misc{guicursor, title = {GUI-Cursor: Cursor-Centric Focusing for GUI Grounding via Multi-Step RL}, year = {2025}, eprint = {2509.21552}, archiveprefix = {arXiv} } ``` ## License Apache 2.0 (matches the base [GUI-G2-3B license](https://huggingface.co/inclusionAI/GUI-G2-3B)).