Mano-CUA-4B-Thinking-1.1-MLX-8bit

Mano-CUA is the Computer Use Agent model under the Mano open-source model series. It is a GUI-VLA (Visual Language Agent) model designed specifically for edge devices, capable of autonomously completing complex desktop GUI operations through visual understanding.

This is the MLX 8-bit quantized version, optimized for Apple Silicon (Mac mini / MacBook). For the full-precision fp16 version, see Mano-CUA-4B-Thinking-1.1.

Main Capabilities

Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries

Technical Background

Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.

Quick Start

Requirements

macOS with Apple Silicon (M1+)
Python >= 3.12

Installation

With Cider (recommended, includes W8A8 acceleration on M5+):

pip install mlx-vlm
pip install git+https://github.com/Mininglamp-AI/cider.git

Without Cider:

pip install mlx-vlm

Single-Step Demo

import mlx_vlm as pm
from vlm_service import custom_generate
from PIL import Image

# 1. Load model
model, processor = pm.load("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit")

# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)

# 3. Build prompt
task = "Click the search bar and type hello"

prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.

## Output Format
<action>action</action>

## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason

## User Instruction
{task}"""

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": prompt_text},
]
prompt = processor.tokenizer.apply_chat_template(
    messages, tokenize=False, add_generation_prompt=True
)
prompt = prompt.replace("<image>", "<|vision_start|><|image_pad|><|vision_end|>")

# 4. Run inference
result = custom_generate(
    model, processor, prompt,
    [img],
    max_tokens=512,
    temperature=0.0,
    prefill_step_size=2048,
)

print(f"Tokens: {result.generation_tokens}, Speed: {result.generation_tps:.1f} tok/s")
print(result.text)

Output Format

The model outputs structured XML:

<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>

Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:

pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)

W8A8 Acceleration (M5+ only)

On Apple M5 or later, enable INT8 acceleration for ~15-19% faster prefill:

from cider import convert_model, is_available

if is_available():
    convert_model(model.language_model)

Full Action Space

Action	Syntax	Description
open_app	`open_app(app_name='')`	Open an application
open_url	`open_url(url='')`	Open a URL
click	`click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Left click
doubleclick	`doubleclick(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Double click
triple_click	`triple_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Triple click (select line)
right_single	`right_single(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Right click
hover	`hover(start_box='<\|box_start\|>(x,y)<\|box_end\|>')`	Mouse hover
type	`type(content='text')`	Type text
hotkey	`hotkey(key='cmd+c')`	Keyboard shortcut
hotkey_click	`hotkey_click(start_box='<\|box_start\|>(x,y)<\|box_end\|>', key='shift')`	Modifier + click
scroll	`scroll(start_box='<\|box_start\|>(x,y)<\|box_end\|>', direction='down', amount='3')`	Scroll
drag	`drag(start_box='<\|box_start\|>(x1,y1)<\|box_end\|>', end_box='<\|box_start\|>(x2,y2)<\|box_end\|>')`	Drag and drop
wait	`wait(duration='2')`	Wait (seconds)
finish	`finish()`	Task completed
stop	`stop(reason='...')`	Task infeasible
call_user	`call_user()`	Request human help

Other Versions

Version	Repo	Description
fp16	Mano-CUA-4B-Thinking-1.1	Full precision, for archival / re-quantization / GPU inference
MLX-8bit (this)	Mano-CUA-4B-Thinking-1.1-MLX-8bit	MLX 8-bit quantized, recommended for Apple Silicon local inference

Contact

Website: https://github.com/Mininglamp-AI/Mano-P
Email: model@mininglamp.com

Downloads last month: 249

Safetensors

Model size

2B params

Tensor type

F16

U32

MLX

Hardware compatibility

8-bit

Paper for Mininglamp-2718/Mano-CUA-4B-Thinking-1.1-MLX-8bit

Mano Report

Paper • 2509.17336 • Published Sep 22, 2025 • 10