--- license: apache-2.0 --- # litert-community/gemma-4-E2B-it-litert-lm Main Model Card: [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) ## Try Gemma 4 E2B
| [](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [](https://ai.google.dev/edge/litert-lm/cli) | [](https://ai.google.dev/edge/litert-lm/cli) | [](#gemma-4-e2b-performance-on-web) | | :---: | :---: | :---: | :---: | :---: | | [Android](https://play.google.com/store/apps/details?id=com.google.ai.edge.gallery&pli=1) | [iOS](https://apps.apple.com/us/app/google-ai-edge-gallery/id6749645337) | [Desktop](https://ai.google.dev/edge/litert-lm/cli) | [IoT](https://ai.google.dev/edge/litert-lm/cli) | [Web](#gemma-4-e2b-performance-on-web) |
## Build with Gemma 4 E2B and LiteRT-LM Ready to integrate this into your product? Get started [here](https://ai.google.dev/edge/litert-lm/overview). ## Gemma 4 E2B Performance on LiteRT-LM All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk. CPU memory was measured using, rusage::ru_maxrss on Android, Linux and Raspberry Pi, task_vm_info::phys_footprint on iOS and MacBook and process_memory_counters::PrivateUsage on Windows. ### Android Benchmarked on S26 Ultra. *Note: On [supported Android devices](https://developers.google.com/ml-kit), Gemma 4 is available through Android AI Core as [Gemini Nano](https://developer.android.com/ai/gemini-nano#architecture), which is the recommended path for production applications.*
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

GPU

TODO

TODO

TODO

TODO

TODO

TODO

### iOS Benchmarked on iPhone 17 Pro.
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

GPU

TODO

TODO

TODO

TODO

TODO

TODO

### Linux Benchmarked on NVIDIA GeForce RTX 4090.
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

GPU

TODO

TODO

TODO

TODO

TODO

TODO

### MacBook Benchmarked on MacBook Pro M4.
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

GPU

TODO

TODO

TODO

TODO

TODO

TODO

### Windows
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

GPU

TODO

TODO

TODO

TODO

TODO

TODO

### IoT Raspberry Pi 5 16GB
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

Qualcomm IQ-8275 EVK
Backend Quantization scheme Prefill (tokens/sec) Decode (tokens/sec) Time-to-first-token (sec) Model size (MB) CPU Memory (RSS in MB)

CPU

TODO

TODO

TODO

TODO

TODO

TODO

## Gemma 4 E2B Performance on Web