litert-community
/

gemma-4-E2B-it-litert-lm

LiteRT-LM

Model card Files Files and versions

xet

Community

marissaw commited on Apr 1

Commit

3f65644

verified ·

1 Parent(s): df86f75

Update README.md

Browse files

Files changed (1) hide show

README.md +10 -8

README.md CHANGED Viewed

@@ -46,23 +46,23 @@ It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
-| **S26 Ultra** | CPU | TODO | TODO | TODO | TODO | TODO |
-| **S26 Ultra** | GPU | TODO | TODO | TODO | TODO | TODO |
 **iOS**
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
-| **iPhone 17 Pro** | CPU | TODO | TODO | TODO | TODO | TODO |
-| **iPhone 17 Pro** | GPU | TODO | TODO | TODO | TODO | TODO |
 **Linux**
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
-| **Arm 2.3 & 2.8GHz** | CPU | TODO | TODO | TODO | TODO | TODO |
-| **NVIDIA GeForce RTX 4090** | GPU | TODO | TODO | TODO | TODO | TODO |
 **macOS**
@@ -82,8 +82,10 @@ It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
-| **Raspberry Pi 5 16GB** | CPU | TODO | TODO | TODO | TODO | TODO |
-| **Qualcomm IQ-8275 EVK** | NPU | TODO | TODO | TODO | TODO | TODO |
 ## Gemma 4 E2B Performance on Web

 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
+| **S26 Ultra** | CPU | 557 | 46.9 | 1.8 | 2583 | 1733 |
+| **S26 Ultra** | GPU | 3,808 | 52.1 | 0.3 | 2583 | 676 |
 **iOS**
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
+| **iPhone 17 Pro** | CPU | 532 | 25.0 | 1.9 | 2583 | 607 |
+| **iPhone 17 Pro** | GPU | 2,878 | 56.5 | 0.3 | 2583 | 1450 |
 **Linux**
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
+| **Arm 2.3 & 2.8GHz** | CPU | 260 | 35.0 | 4.0 | 2583 | 1628 |
+| **NVIDIA GeForce RTX 4090** | GPU | 11,234 | 143.4 | 0.1 | 2583 | 913 |
 **macOS**
 | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
 | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
+| **Raspberry Pi 5 16GB** | CPU | 133 | 7.6 | 7.8 | 2583 | 1546 |
+| **Qualcomm IQ-8275 EVK** | NPU* | 2371 | 18.8 | 0.5 | 2688 | 1471 |
+\* NPU model is benchmarked with 4096 context length
 ## Gemma 4 E2B Performance on Web