--- license: apache-2.0 --- # litert-community/gemma-4-E2B-it-litert-lm Main Model Card: [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) This model card provides the Gemma 4 E2B model in a way that is ready for deployment on Android, iOS, Desktop IoT and Web. Gemma is a family of lightweight, state-of-the-art open models from Google, built from the same research and technology used to create the Gemini models. This particular Gemma 4 model is small so it is ideal for on-device use cases. By running this model on device, users can have private access to Generative AI technology without even requiring an internet connection. These models are provided in the `.litertlm` format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app. The model file size is 2.58 GB, which consists of a text decoder with 0.79 GB of weights and 1.1GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, but it only memory maps the embedding parameters as only a fraction of these are required for each inference. The vision and audio models are loaded as needed to further reduce memory consumption. ## Try Gemma 4 E2B

| [](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://ai.google.dev/edge/litert-lm/cli) | [](#gemma-4-e2b-performance-on-web) | | :---: | :---: | :---: | :---: | :---: | | [Android](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [iOS](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [Desktop](https://ai.google.dev/edge/litert-lm/cli) | [IoT](https://ai.google.dev/edge/litert-lm/cli) | [Web](#gemma-4-e2b-performance-on-web) |

## Build with Gemma 4 E2B and LiteRT-LM Ready to integrate this into your product? Get started [here](https://ai.google.dev/edge/litert-lm/overview). ## Gemma 4 E2B Performance on LiteRT-LM All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk. CPU memory was measured using, `rusage::ru_maxrss` on Android, Linux and Raspberry Pi, `task_vm_info::phys_footprint` on iOS and MacBook and `process_memory_counters::PrivateUsage` on Windows. It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8bit weights. **Android** *Note: On [supported Android devices](https://developers.google.com/ml-kit), Gemma 4 is available through Android AI Core as [Gemini Nano](https://developer.android.com/ai/gemini-nano#architecture), which is the recommended path for production applications.* | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | S26 Ultra | CPU | 557 | 46.9 | 1.8 | 2583 | 1733 | | S26 Ultra | GPU | 3,808 | 52.1 | 0.3 | 2583 | 676 | **iOS** | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | iPhone 17 Pro | CPU | 532 | 25.0 | 1.9 | 2583 | 607 | | iPhone 17 Pro | GPU | 2,878 | 56.5 | 0.3 | 2583 | 1450 | **Linux** | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Arm 2.3 & 2.8GHz | CPU | 260 | 35.0 | 4.0 | 2583 | 1628 | | NVIDIA GeForce RTX 4090 | GPU | 11,234 | 143.4 | 0.1 | 2583 | 913 | **macOS** | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | MacBook Pro M4 | CPU | TODO | TODO | TODO | TODO | TODO | | MacBook Pro M4 | GPU | TODO | TODO | TODO | TODO | TODO | **Windows** | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Windows | CPU | TODO | TODO | TODO | TODO | TODO | | Windows | GPU | TODO | TODO | TODO | TODO | TODO | **IoT** | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | Time-to-first-token (sec) | Model size (MB) | CPU Memory (MB) | | :---- | :---- | :---- | :---- | :---- | :---- | :---- | | Raspberry Pi 5 16GB | CPU | 133 | 7.6 | 7.8 | 2583 | 1546 | | Qualcomm IQ-8275 EVK | NPU* | 2371 | 18.8 | 0.5 | 2688 | 1471 | \* NPU model is benchmarked with 4096 context length ## Gemma 4 E2B Performance on Web