Mano-CUA-4B-Thinking-1.1-MLX-8bit

Mano-CUA is the Computer Use Agent model under the Mano open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding.

This is the MLX 8-bit quantized version, optimized for Apple Silicon (Mac mini / MacBook). For the full-precision fp16 version, see Mano-CUA-4B-Thinking-1.1.

Main Capabilities

  • Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
  • Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
  • Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
  • Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries

Technical Background

Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.

Quick Start

Requirements

  • macOS with Apple Silicon (M1+)
  • Python >= 3.12

Installation

With Cider (recommended, includes W8A8 acceleration on M5+):

pip install mlx-vlm
pip install git+https://github.com/Mininglamp-AI/cider.git

Without Cider:

pip install mlx-vlm

Single-Step Demo

import mlx_vlm as pm
from vlm_service import custom_generate
from PIL import Image

# 1. Load model
model, processor = pm.load("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit")

# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)

# 3. Build prompt
task = "Click the search bar and type hello"

prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
<action>action</action>

## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason

## User Instruction
{task}"""

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt_text},
]
prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
prompt = prompt.replace("<image>", "<|vision_start|><|image_pad|><|vision_end|>")

# 4. Run inference
result = custom_generate(
    model, processor, prompt,
    [img],
    max_tokens=512,
    temperature=0.0,
    prefill_step_size=2048,
)

print(f"Tokens: {result.generation_tokens}, Speed: {result.generation_tps:.1f} tok/s")
print(result.text)

Output Format

The model outputs structured XML:

<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>

Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)

W8A8 Acceleration (M5+ only)

On Apple M5 or later, enable INT8 acceleration for ~15-19% faster prefill:

from cider import convert_model, is_available

if is_available():
    convert_model(model.language_model)

Full Action Space

Action Syntax Description
open_app open_app(app_name='') Open an application
open_url open_url(url='') Open a URL
click click(start_box='<|box_start|>(x,y)<|box_end|>') Left click
doubleclick doubleclick(start_box='<|box_start|>(x,y)<|box_end|>') Double click
triple_click triple_click(start_box='<|box_start|>(x,y)<|box_end|>') Triple click (select line)
right_single right_single(start_box='<|box_start|>(x,y)<|box_end|>') Right click
hover hover(start_box='<|box_start|>(x,y)<|box_end|>') Mouse hover
type type(content='text') Type text
hotkey hotkey(key='cmd+c') Keyboard shortcut
hotkey_click hotkey_click(start_box='<|box_start|>(x,y)<|box_end|>', key='shift') Modifier + click
scroll scroll(start_box='<|box_start|>(x,y)<|box_end|>', direction='down', amount='3') Scroll
drag drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x2,y2)<|box_end|>') Drag and drop
wait wait(duration='2') Wait (seconds)
finish finish() Task completed
stop stop(reason='...') Task infeasible
call_user call_user() Request human help

Other Versions

Version Repo Description
fp16 Mano-CUA-4B-Thinking-1.1 Full precision, for archival / re-quantization / GPU inference
MLX-8bit (this) Mano-CUA-4B-Thinking-1.1-MLX-8bit MLX 8-bit quantized, recommended for Apple Silicon local inference

Contact

Downloads last month
249
Safetensors
Model size
2B params
Tensor type
F16
·
U32
·
MLX
Hardware compatibility
Log In to add your hardware

8-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Paper for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit