--- license: apache-2.0 --- # litert-community/gemma-4-E2B-it-litert-lm Main Model Card: [google/gemma-4-E2B-it](https://huggingface.co/google/gemma-4-E2B-it) ## Try Gemma 4 E2B

## Build with Gemma 4 E2B and LiteRT-LM Ready to integrate this into your product? Get started [here](https://ai.google.dev/edge/litert-lm/overview). ## Gemma 4 E2B Performance on LiteRT-LM All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk. CPU memory was measured using, rusage::ru_maxrss on Android, Linux and Raspberry Pi, task_vm_info::phys_footprint on iOS and MacBook and process_memory_counters::PrivateUsage on Windows. ### Android Benchmarked on S26 Ultra. *Note: On [supported Android devices](https://developers.google.com/ml-kit), Gemma 4 is available through Android AI Core as [Gemini Nano](https://developer.android.com/ai/gemini-nano#architecture), which is the recommended path for production applications.*

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO
GPU	TODO	TODO	TODO	TODO	TODO	TODO

### iOS Benchmarked on iPhone 17 Pro.

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO
GPU	TODO	TODO	TODO	TODO	TODO	TODO

### Linux Benchmarked on NVIDIA GeForce RTX 4090.

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO
GPU	TODO	TODO	TODO	TODO	TODO	TODO

### MacBook Benchmarked on MacBook Pro M4.

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO
GPU	TODO	TODO	TODO	TODO	TODO	TODO

### Windows

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO
GPU	TODO	TODO	TODO	TODO	TODO	TODO

### IoT Raspberry Pi 5 16GB

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO

Qualcomm IQ-8275 EVK

Backend	Quantization scheme	Prefill (tokens/sec)	Decode (tokens/sec)	Time-to-first-token (sec)	Model size (MB)	CPU Memory (RSS in MB)
CPU	TODO	TODO	TODO	TODO	TODO	TODO

## Gemma 4 E2B Performance on Web