LocateAnything-3B

LocateAnything-3B is a specialized vision-language model developed by NVIDIA for visual grounding and object localization tasks. This repository contains GGUF quantized variants of the model optimized for efficient local inference.

Unlike general-purpose multimodal assistants, LocateAnything-3B is designed to identify the precise spatial location of objects, regions, text, GUI elements, and entities referenced in natural language prompts. The model combines strong visual understanding with high-quality localization capabilities, enabling efficient deployment across perception-oriented workflows.

The model introduces Parallel Box Decoding (PBD), which predicts complete geometric structures simultaneously rather than autoregressively generating coordinate tokens, resulting in improved throughput and localization quality.


Model Overview

  • Model Name: LocateAnything-3B
  • Base Model: nvidia/LocateAnything-3B
  • Architecture: Vision-Language Model with Parallel Box Decoding
  • Parameter Count: 3 Billion
  • Modalities: Text, Image
  • Primary Languages: English
  • Developer: NVIDIA
  • License: nvidia-license
  • Core Capability: General-purpose visual grounding and localization

Quantization Formats

This repository provides various GGUF quantized versions of the LocateAnything-3B model optimized for efficient multimodal inference and visual grounding workloads.

IQ3_M

  • Size reduction of approx 76.18% (1.51 GB) compared to 16-bit (6.34 GB)
  • Aggressive 3-bit quantization optimized for maximum memory efficiency
  • Suitable for low-memory multimodal deployment environments
  • Enables practical execution of visual grounding models on consumer hardware
  • Fine-grained localization precision and dense grounding performance may reduce compared to higher-precision variants

IQ4_NL

  • Size reduction of approx 70.35% (1.88 GB) compared to 16-bit (6.34 GB)
  • Advanced 4-bit non-linear quantization designed to better preserve localization fidelity and multimodal understanding quality
  • Better suited for grounding workflows requiring stronger coordinate consistency and object localization accuracy
  • Designed to reduce quantization loss compared to more aggressive formats
  • May require slightly increased computational overhead during inference

IQ4_XS

  • Size reduction of approx 71.77% (1.79 GB) compared to 16-bit (6.34 GB)
  • Balanced 4-bit quantization focused on efficient inference and dependable visual grounding performance
  • Provides a practical balance between memory efficiency, localization quality, and runtime speed
  • Suitable for general-purpose grounding applications and multimodal deployment scenarios
  • Maintains stable performance across most practical localization workloads

Training Background (Original Model)

LocateAnything-3B is trained with a focus on large-scale visual grounding, multimodal localization, and spatial reasoning across diverse visual domains.

Pretraining

  • Trained using large-scale multimodal datasets spanning natural scenes, documents, robotics, driving environments, and GUI interfaces
  • Focus on aligning visual representations with natural language descriptions
  • Optimized for precise object localization and grounding tasks

Grounding Optimization

  • Enhanced using large-scale grounding datasets containing millions of localization annotations
  • Introduces Parallel Box Decoding to improve throughput and geometric consistency
  • Optimized for referring expression grounding, dense localization, and multimodal perception workflows

Key Capabilities

  • Visual Grounding Localizes objects and entities referenced through natural language descriptions.

  • Referring Expression Grounding Identifies precise regions corresponding to textual descriptions within images.

  • Multi-Object Localization Supports grounding multiple entities within complex visual scenes.

  • GUI Element Grounding Locates interface components and screen elements in GUI environments.

  • Text Localization Supports grounding of textual regions embedded within visual inputs.

  • Efficient Local Deployment Quantized variants enable practical multimodal inference on consumer hardware.


Usage Example

Using llama.cpp

./llama-mtmd-cli \
  -m SandlogicTechnologies/LocateAnything-3B_IQ4_NL.gguf \
  --mmproj SandlogicTechnologies/mmproj-LocateAnything-3B-BF16.gguf \
  --image image.png \
  -p "Analyze the diagram and explain the underlying concept step-by-step."

Recommended Usecases

  • Visual Grounding Systems Build applications capable of locating objects from natural language descriptions.

  • Multimodal Perception Pipelines Integrate visual localization capabilities into perception-oriented AI systems.

  • GUI Understanding Workflows Support interface automation and GUI element identification tasks.

  • Document and Scene Understanding Ground textual descriptions within complex visual environments.

  • Research and Experimentation Evaluate multimodal localization techniques and grounding workflows.


Acknowledgments

These quantized models are based on the original work by the NVIDIA Research team.

Special thanks to:

  • The NVIDIA team for developing and releasing the LocateAnything-3B model.
  • Georgi Gerganov and the llama.cpp open-source community for enabling efficient GGUF quantization and inference.

Contact

For questions, feedback, or support, please reach out at support@sandlogic.com or visit https://www.sandlogic.com/

Downloads last month
644
GGUF
Model size
3B params
Architecture
qwen2
Hardware compatibility
Log In to add your hardware

3-bit

4-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for SandLogicTechnologies/LocateAnything-3B-GGUF

Base model

Qwen/Qwen2.5-3B
Quantized
(16)
this model