Instructions to use litert-community/gemma-4-E2B-it-litert-lm with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- LiteRT-LM
How to use litert-community/gemma-4-E2B-it-litert-lm with LiteRT-LM:
# LiteRT-LM runs on various platforms (Android, iOS, Windows, Linux, macOS, IoT, Web/WASM) # and supports many APIs (C++, Python, Kotlin, Swift, JavaScript, Flutter). # For platform-specific integration guides, please refer to the official developer website: # https://ai.google.dev/edge/litert-lm # To try LiteRT-LM, the easiest way is to use our CLI tool. # 1. Install the LiteRT-LM CLI tool: pip install litert-lm # 2. Download and run this model locally: # See: https://ai.google.dev/edge/litert-lm/cli litert-lm run \ --from-huggingface-repo=litert-community/gemma-4-E2B-it-litert-lm \ model.litertlm \ --prompt="Write me a poem"
- Notebooks
- Google Colab
- Kaggle
Update README.md
Browse files
README.md
CHANGED
|
@@ -46,23 +46,23 @@ It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8
|
|
| 46 |
|
| 47 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 48 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 49 |
-
| **S26 Ultra** | CPU |
|
| 50 |
-
| **S26 Ultra** | GPU |
|
| 51 |
|
| 52 |
|
| 53 |
**iOS**
|
| 54 |
|
| 55 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 56 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 57 |
-
| **iPhone 17 Pro** | CPU |
|
| 58 |
-
| **iPhone 17 Pro** | GPU |
|
| 59 |
|
| 60 |
**Linux**
|
| 61 |
|
| 62 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 63 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 64 |
-
| **Arm 2.3 & 2.8GHz** | CPU |
|
| 65 |
-
| **NVIDIA GeForce RTX 4090** | GPU |
|
| 66 |
|
| 67 |
**macOS**
|
| 68 |
|
|
@@ -82,8 +82,10 @@ It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8
|
|
| 82 |
|
| 83 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 84 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 85 |
-
| **Raspberry Pi 5 16GB** | CPU |
|
| 86 |
-
| **Qualcomm IQ-8275 EVK** | NPU |
|
|
|
|
|
|
|
| 87 |
|
| 88 |
|
| 89 |
## Gemma 4 E2B Performance on Web
|
|
|
|
| 46 |
|
| 47 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 48 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 49 |
+
| **S26 Ultra** | CPU | 557 | 46.9 | 1.8 | 2583 | 1733 |
|
| 50 |
+
| **S26 Ultra** | GPU | 3,808 | 52.1 | 0.3 | 2583 | 676 |
|
| 51 |
|
| 52 |
|
| 53 |
**iOS**
|
| 54 |
|
| 55 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 56 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 57 |
+
| **iPhone 17 Pro** | CPU | 532 | 25.0 | 1.9 | 2583 | 607 |
|
| 58 |
+
| **iPhone 17 Pro** | GPU | 2,878 | 56.5 | 0.3 | 2583 | 1450 |
|
| 59 |
|
| 60 |
**Linux**
|
| 61 |
|
| 62 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 63 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 64 |
+
| **Arm 2.3 & 2.8GHz** | CPU | 260 | 35.0 | 4.0 | 2583 | 1628 |
|
| 65 |
+
| **NVIDIA GeForce RTX 4090** | GPU | 11,234 | 143.4 | 0.1 | 2583 | 913 |
|
| 66 |
|
| 67 |
**macOS**
|
| 68 |
|
|
|
|
| 82 |
|
| 83 |
| Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
|
| 84 |
| :---- | :---- | :---- | :---- | :---- | :---- | :---- |
|
| 85 |
+
| **Raspberry Pi 5 16GB** | CPU | 133 | 7.6 | 7.8 | 2583 | 1546 |
|
| 86 |
+
| **Qualcomm IQ-8275 EVK** | NPU* | 2371 | 18.8 | 0.5 | 2688 | 1471 |
|
| 87 |
+
|
| 88 |
+
\* NPU model is benchmarked with 4096 context length
|
| 89 |
|
| 90 |
|
| 91 |
## Gemma 4 E2B Performance on Web
|