LiteRT-LM
marissaw commited on
Commit
bff610f
·
verified ·
1 Parent(s): fe8a431

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +11 -1
README.md CHANGED
@@ -81,11 +81,21 @@ It uses the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8
81
  | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
82
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
83
  | Raspberry Pi 5 16GB | CPU | 133 | 7.6 | 7.8 | 2583 | 1546 |
84
- | Qualcomm IQ-8275 EVK | NPU* | 2371 | 18.8 | 0.5 | 2688 | 1471 |
85
 
86
  \* NPU model is benchmarked with 4096 context length
87
 
88
 
89
  ## Gemma 4 E2B Performance on Web
90
 
 
91
 
 
 
 
 
 
 
 
 
 
 
81
  | Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) |
82
  | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
83
  | Raspberry Pi 5 16GB | CPU | 133 | 7.6 | 7.8 | 2583 | 1546 |
84
+ | Qualcomm IQ-8275 EVK | NPU* | 2,371 | 18.8 | 0.5 | 2688 | 1471 |
85
 
86
  \* NPU model is benchmarked with 4096 context length
87
 
88
 
89
  ## Gemma 4 E2B Performance on Web
90
 
91
+ Running Gemma inference on the web is currently supported through [LLM Inference Engine](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) and uses the *gemma-4-E2B-it-web.task* model file. To try it out, download [the web model](https://huggingface.co/litert-community/gemma-4-E2B-it-litert-lm/blob/main/gemma-4-E2B-it-web.task) and run with our [sample web page](https://github.com/google-ai-edge/mediapipe-samples/blob/main/examples/llm_inference/js/README.md), or follow the [guide](https://ai.google.dev/edge/mediapipe/solutions/genai/llm_inference/web_js) to add it to your own app.
92
 
93
+ Benchmarked in Chrome on a MacBook Pro 2024 (Apple M4 Max) with 1024 prefill tokens and 256 decode tokens, but the model can support context lengths up to 128K.
94
+
95
+ | Device | Backend | Prefill (tokens/sec) | Decode (tokens/sec) | <span style="white-space: nowrap;">Time-to-first</span>-token (sec) | Model size (MB) | CPU Memory (MB) | GPU Memory (MB) |
96
+ | :---- | :---- | :---- | :---- | :---- | :---- | :---- | :---- |
97
+ | Web | GPU | 4,676 | 73.9 | 1.1 | 1.5 | 1546 | 1.8 |
98
+
99
+ <small>\* GPU memory measured by "GPU Process" memory for all of Chrome while running. Was 130MB when inactive, before any model loading took place.
100
+
101
+ \* CPU memory measured for the entire tab while running. Was 55MB when inactive, before any model loading took place.<small>