Updated Readme.md (#28)

a824f6a 16 days ago

13.8 kB

license: apache-2.0
base_model:
  - google/gemma-4-E2B-it
tags:
  - litert-lm

litert-community/gemma-4-E2B-it-litert-lm

Main Model Card: google/gemma-4-E2B-it

This model card provides the Gemma 4 E2B model in a way that is ready for deployment on Android, iOS, Desktop, IoT and Web.

Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is small so it is ideal for on-device use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection.

These models are provided in the .litertlm format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app.

LiteRT-LM uese the state of the art Gemma-4 mobile quantization scheme that uses a mixture of 2bit, 4bit and 8 bit weights. This means that for text only use cases the weight footprint in memory can be as low as 0.8 GB while the runtime uses memory mapping to support the 1.12GB of embedding parameters. This approach gives significant working memory savings on some platforms as seen in the more detailed data below. Additionally the Vision and Audio models are loaded on demand to further reduce memory consumption.

Try Gemma 4 E2B


Android	iOS	Desktop	IoT	Web

Build with Gemma 4 E2B and LiteRT-LM

Ready to integrate this into your product? Get started here.

Gemma 4 E2B Performance on LiteRT-LM

All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk.

CPU memory was measured using, rusage::ru_maxrss on Android, Linux and Raspberry Pi, task_vm_info::phys_footprint on iOS and MacBook and process_memory_counters::PrivateUsage on Windows.

Android

Note: On supported Android devices, Gemma 4 is available through Android AI Core as Gemini Nano, which is the recommended path for production applications.

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (MB)
S26 Ultra	CPU	557	46.9	1.8	2583	1733
S26 Ultra	GPU	3,808	52.1	0.3	2583	676

🚨 NEW: Android with Speculative Decoding 🚨

The numbers in this section include speculative decoding. Speculative decoding is an optimization that accelerates LLMs by using a small, fast "draft" model to quickly predict multiple upcoming tokens, while a larger “target” model then verifies those tokens in parallel. The effectiveness of speculative decoding is task dependent because the “draft” model can more easily predict the correct tokens of some tasks. The metrics in this section were collected from a variety of sample prompts and grouped into categories by task type. The baseline measurements are an average across all task types. The number of input and output tokens varied across prompts. Note that if you download this model before May 5, 2026, you should re-download the model if you want to use speculative decoding. Speculative decoding is available on CPU and GPU on Mobile and Desktop.

Device	Backend	Task Type	Speculative Decoding?	Decode (tokens/sec)	CPU Memory (MB)
S26 Ultra	CPU	Baseline	No	40.7	1362
S26 Ultra	CPU	Summarize text	Yes	47.5	1582
S26 Ultra	CPU	Code snippet	Yes	36.3	1440
S26 Ultra	CPU	Rewrite tone	Yes	47.1	1408
S26 Ultra	CPU	Free form	Yes	38.1	1459
S26 Ultra	GPU	Baseline	No	51.5	791
S26 Ultra	GPU	Summarize text	Yes	91.7	817
S26 Ultra	GPU	Code snippet	Yes	84.4	788
S26 Ultra	GPU	Rewrite tone	Yes	87.4	762
S26 Ultra	GPU	Free form	Yes	66.5	804

iOS

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU/GPU Memory (MB)
iPhone 17 Pro	CPU	532	25.0	1.9	2583	607
iPhone 17 Pro	GPU	2,878	56.5	0.3	2583	1450

Linux

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (MB)
Arm 2.3 & 2.8GHz	CPU	260	35.0	4.0	2583	1628
NVIDIA GeForce RTX 4090	GPU	11,234	143.4	0.1	2583	913

macOS

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU/GPU Memory (MB)
MacBook Pro M4 Max	CPU	901	41.6	1.1	2583	736
MacBook Pro M4 Max	GPU	7,835	160.2	0.1	2583	1623

Windows

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (MB)
Intel LunarLake	CPU	435	29.8	2.39	2583	3505
Intel LunarLake	GPU	3,751	48.4	0.29	2583	3540

Web

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	GPU Memory (MB)
Macbook Pro M4 Max	WebGPU	4,853	73	1.09	2008	~1800

Web on LiteRT-LM uses a specially optimized model for Web because of its unique memory constraints. Currently the model is text-only.

IoT

Device	Backend	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (MB)
Raspberry Pi 5 16GB	CPU	133	7.6	7.8	2583	1546
Jetson Orin Nano	CPU	109	12.2	9.4	2583	3681
Jetson Orin Nano	GPU	1,142	24.2	0.9	2583	2739
Qualcomm Dragonwing IQ8 (IQ-8275)	NPU	3,747	31.7	0.3	2967	1869

NPU model is benchmarked with 4096 context length

Running Gemma 4 E2B on Web with MediaPipe

You can also run Gemma through MediaPipe LLM Inference Engine. However, this route is currently in maintenance mode. To add it to your existing MediaPipe flow, download the gemma-4-E2B-it-web.task model file and run with our sample web page, or follow the guide to add it to your own app.