How to use from the
Use from the
LiteRT-LM library
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM)
# and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter).
# For platform-specific integration guides, please refer to the official developer website:
# https://ai.google.dev/edge/litert-lm

# To try LiteRT-LM, the easiest way is to use our CLI tool.
# 1. Install the LiteRT-LM CLI tool:
pip install litert-lm

# 2. Download and run this model locally:
# See: https://ai.google.dev/edge/litert-lm/cli
litert-lm run \
  --from-huggingface-repo=litert-community/Qwen3-0.6B \
  model.litertlm \
  --prompt="Write me a poem"

litert-community/Qwen3-0.6B

Main model card: Qwen/Qwen3-0.6B

This repository contains LiteRT-LM variants of Qwen3-0.6B for Android and desktop deployment.

Available Artifacts

File Quantization Context Size
Qwen3-0.6B.litertlm dynamic INT8 weights, float KV 4096 586 MB
Qwen3-0.6B.mediatek.mt6993.litertlm a16w8 NPU-targeted 4096 992 MB
qwen3_0_6b_mixed_int4.litertlm TorchAO mixed INT4, float KV 2048 474.61 MiB

Conversion Notes

The mixed INT4 .litertlm artifact was produced with a TorchAO-based quantize-first recipe from the original Hugging Face checkpoint. This is a mixed quantization layout rather than a uniform all-INT4 model: eligible linear projection weights are stored as blockwise INT4 with group size 32 and floating-point scales, token embedding weights use weight-only INT8 quantization, and normalization/reduction paths plus KV cache tensors remain floating point.

The mixed INT4 bundle also uses LiteRT-LM StableHLO composite ops for attention/cache execution, including odml.runtime_bmm and odml.cache_update.

Qwen3-0.6B.litertlm is a separate dynamic INT8 artifact. It was converted through the LiteRT Torch (litert-torch) path and quantized with AI Edge Quantizer. This artifact is independent from qwen3_0_6b_mixed_int4.litertlm, which uses the TorchAO-based mixed INT4 recipe described above.

Android Performance Examples

These are representative measurements from retail devices to give a rough sense of on-device runtime behavior, not a direct comparison between hardware platforms. All numbers were collected with LiteRT-LM's litert_lm_advanced_main launched from an adb command line on the connected device; they are not app-level measurements from an integrated Android application.

Hardware benchmark disclosure: Results were measured by us on retail devices purchased through normal channels. These results are not affiliated with, sponsored by, endorsed by, or verified by Samsung, vivo, Qualcomm, MediaTek, Google, MLCommons, or Hugging Face. Results depend on device SKU, OS build, thermal state, battery mode, backend, model quantization, runtime version, and benchmark settings.

qwen3_0_6b_mixed_int4.litertlm

Context: 2048. Shape: 256 prefill tokens / 256 decode tokens. Rows use LiteRT-LM v0.13.1. Values report the warmed iteration from a two-iteration run unless noted.

Example device Backend Prefill (tok/s) Decode (tok/s) TTFT (s) Peak Private Footprint
Samsung SM-S937U1 GPU OpenCL 1844.95 69.38 0.150 585 MB
vivo V2502A GPU OpenCL 1055.89 22.34 0.285 1856 MB
TECNO LJ9 GPU OpenCL 637.01 33.51 0.430 1832 MB
Samsung SM-S937U1 CPU 576.59 12.90 0.520 2895 MB
TECNO LJ9 CPU 231.15 8.33 1.230 2890 MB

Qwen3-0.6B.litertlm

Context: 4096. Samsung and TECNO rows use 256 prefill tokens / 256 decode tokens with LiteRT-LM v0.13.1. The vivo rows are previously published 4096-context reference results; TTFT, peak footprint, and exact prompt/decode shape were not recorded in this update.

Example device Backend Prefill (tok/s) Decode (tok/s) TTFT (s) Peak Private Footprint
Samsung SM-S937U1 GPU OpenCL 646.33 25.31 0.440 2940 MB
TECNO LJ9 GPU OpenCL 254.24 12.10 1.090 4283 MB
vivo V2502A GPU OpenCL 580 21 - -
Samsung SM-S937U1 CPU 212.07 13.02 1.280 2697 MB
TECNO LJ9 CPU 95.14 9.32 2.800 2699 MB
vivo V2502A CPU 165 9 - -

Qwen3-0.6B.mediatek.mt6993.litertlm

Context: 4096. This is a previously published MediaTek MT6993 NPU reference result; TTFT, peak footprint, and exact prompt/decode shape were not recorded in this update.

Example device Backend Prefill (tok/s) Decode (tok/s) TTFT (s) Peak Private Footprint
vivo V2502A NPU 1472 36 - -

Desktop Smoke Benchmark

Benchmarked on AMD Radeon AI PRO R9700 via LiteRT-LM WebGPU with 256 prefill tokens and 32 decode tokens.

Backend Prefill (tok/s) Decode (tok/s) TTFT (s) Peak Private Footprint
GPU WebGPU 4257.13 142.07 0.07 803 MB

Try It

Install uv and run:

uv tool install litert-lm
uvx litert-lm run --from-huggingface-repo=litert-community/Qwen3-0.6B qwen3_0_6b_mixed_int4.litertlm --prompt="What is the capital of France?"

Integration

Ready to integrate this into your product? Get started in the LiteRT-LM documentation.

Citation

@misc{qwen3technicalreport,
      title={Qwen3 Technical Report},
      author={Qwen Team},
      year={2025},
      eprint={2505.09388},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2505.09388},
}
Downloads last month
37,349
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for litert-community/Qwen3-0.6B

Finetuned
Qwen/Qwen3-0.6B
Quantized
(334)
this model

Collection including litert-community/Qwen3-0.6B

Paper for litert-community/Qwen3-0.6B