--- language: - en license: apache-2.0 license_link: https://huggingface.co/Qwen/Qwen3-0.6B/blob/main/LICENSE pipeline_tag: text-generation base_model: Qwen/Qwen3-0.6B base_model_relation: quantized library_name: litert-lm tags: - litert-lm - litertlm - qwen - Qwen3 --- # litert-community/Qwen3-0.6B Main model card: [Qwen/Qwen3-0.6B](https://huggingface.co/Qwen/Qwen3-0.6B) This repository contains LiteRT-LM variants of Qwen3-0.6B for Android and desktop deployment. ## Available Artifacts | File | Quantization | Context | Size | |---|---|---:|---:| | `Qwen3-0.6B.litertlm` | dynamic INT8 weights, float KV | 4096 | 586 MB | | `Qwen3-0.6B.mediatek.mt6993.litertlm` | a16w8 NPU-targeted | 4096 | 992 MB | | `qwen3_0_6b_mixed_int4.litertlm` | TorchAO mixed INT4, float KV | 2048 | 474.61 MiB | ## Conversion Notes The mixed INT4 `.litertlm` artifact was produced with a TorchAO-based quantize-first recipe from the original Hugging Face checkpoint. This is a mixed quantization layout rather than a uniform all-INT4 model: eligible linear projection weights are stored as blockwise INT4 with group size 32 and floating-point scales, token embedding weights use weight-only INT8 quantization, and normalization/reduction paths plus KV cache tensors remain floating point. The mixed INT4 bundle also uses LiteRT-LM StableHLO composite ops for attention/cache execution, including `odml.runtime_bmm` and `odml.cache_update`. `Qwen3-0.6B.litertlm` is a separate dynamic INT8 artifact. It was converted through the LiteRT Torch (`litert-torch`) path and quantized with AI Edge Quantizer. This artifact is independent from `qwen3_0_6b_mixed_int4.litertlm`, which uses the TorchAO-based mixed INT4 recipe described above. ## Android Performance Examples These are representative measurements from retail devices to give a rough sense of on-device runtime behavior, not a direct comparison between hardware platforms. All numbers were collected with LiteRT-LM's `litert_lm_advanced_main` launched from an adb command line on the connected device; they are not app-level measurements from an integrated Android application. Hardware benchmark disclosure: Results were measured by us on retail devices purchased through normal channels. These results are not affiliated with, sponsored by, endorsed by, or verified by Samsung, vivo, Qualcomm, MediaTek, Google, MLCommons, or Hugging Face. Results depend on device SKU, OS build, thermal state, battery mode, backend, model quantization, runtime version, and benchmark settings. ### `qwen3_0_6b_mixed_int4.litertlm` Context: 2048. Shape: 256 prefill tokens / 256 decode tokens. Rows use LiteRT-LM v0.13.1. Values report the warmed iteration from a two-iteration run unless noted. | Example device | Backend | Prefill (tok/s) | Decode (tok/s) | TTFT (s) | Peak Private Footprint | |---|---|---:|---:|---:|---:| | Samsung SM-S937U1 | GPU OpenCL | 1844.95 | 69.38 | 0.150 | 585 MB | | vivo V2502A | GPU OpenCL | 1055.89 | 22.34 | 0.285 | 1856 MB | | TECNO LJ9 | GPU OpenCL | 637.01 | 33.51 | 0.430 | 1832 MB | | Samsung SM-S937U1 | CPU | 576.59 | 12.90 | 0.520 | 2895 MB | | TECNO LJ9 | CPU | 231.15 | 8.33 | 1.230 | 2890 MB | ### `Qwen3-0.6B.litertlm` Context: 4096. Samsung and TECNO rows use 256 prefill tokens / 256 decode tokens with LiteRT-LM v0.13.1. The vivo rows are previously published 4096-context reference results; TTFT, peak footprint, and exact prompt/decode shape were not recorded in this update. | Example device | Backend | Prefill (tok/s) | Decode (tok/s) | TTFT (s) | Peak Private Footprint | |---|---|---:|---:|---:|---:| | Samsung SM-S937U1 | GPU OpenCL | 646.33 | 25.31 | 0.440 | 2940 MB | | TECNO LJ9 | GPU OpenCL | 254.24 | 12.10 | 1.090 | 4283 MB | | vivo V2502A | GPU OpenCL | 580 | 21 | - | - | | Samsung SM-S937U1 | CPU | 212.07 | 13.02 | 1.280 | 2697 MB | | TECNO LJ9 | CPU | 95.14 | 9.32 | 2.800 | 2699 MB | | vivo V2502A | CPU | 165 | 9 | - | - | ### `Qwen3-0.6B.mediatek.mt6993.litertlm` Context: 4096. This is a previously published MediaTek MT6993 NPU reference result; TTFT, peak footprint, and exact prompt/decode shape were not recorded in this update. | Example device | Backend | Prefill (tok/s) | Decode (tok/s) | TTFT (s) | Peak Private Footprint | |---|---|---:|---:|---:|---:| | vivo V2502A | NPU | 1472 | 36 | - | - | ## Desktop Smoke Benchmark Benchmarked on AMD Radeon AI PRO R9700 via LiteRT-LM WebGPU with 256 prefill tokens and 32 decode tokens. | Backend | Prefill (tok/s) | Decode (tok/s) | TTFT (s) | Peak Private Footprint | |---|---:|---:|---:|---:| | GPU WebGPU | 4257.13 | 142.07 | 0.07 | 803 MB | ## Try It Install [uv](https://docs.astral.sh/uv/getting-started/installation/) and run: ```bash uv tool install litert-lm uvx litert-lm run --from-huggingface-repo=litert-community/Qwen3-0.6B qwen3_0_6b_mixed_int4.litertlm --prompt="What is the capital of France?" ``` ## Integration Ready to integrate this into your product? Get started in the [LiteRT-LM documentation](https://ai.google.dev/edge/litert-lm/overview). ### Citation ``` @misc{qwen3technicalreport, title={Qwen3 Technical Report}, author={Qwen Team}, year={2025}, eprint={2505.09388}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2505.09388}, } ```