litert-community
/

gemma-4-E2B-it-litert-lm

LiteRT-LM

Model card Files Files and versions

xet

Community

marissaw commited on May 5

Commit

b4f4f4d

verified ·

1 Parent(s): 6e5c4f1

Update README.md

Browse files

Files changed (1) hide show

README.md +18 -1

README.md CHANGED Viewed

@@ -16,7 +16,7 @@ Gemma is a family of lightweight, state-of-the-art open models from Google, buil
 These models are provided in the `.litertlm` format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app.
-The model file size is 2.58 GB, which includes a text decoder with 0.79GB of weights and 1.12GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, while the embedding parameters are memory mapped which enables significant working memory savings on some platforms as seen in the detailed data below. The vision and audio models are loaded as needed to further reduce memory consumption.
 ## Try Gemma 4 E2B
@@ -49,6 +49,23 @@ CPU memory was measured using, `rusage::ru_maxrss` on Android, Linux and Raspber
 | S26 Ultra | CPU | 557 | 46.9 | 1.8 | 2583 | 1733 |
 | S26 Ultra | GPU | 3,808 | 52.1 | 0.3 | 2583 | 676 |
 **iOS**

 These models are provided in the `.litertlm` format for use with the LiteRT-LM framework. LiteRT-LM is a specialized orchestration layer built directly on top of LiteRT, Google’s high-performance multi-platform runtime trusted by millions of Android and edge developers. LiteRT provides the foundational hardware acceleration via XNNPack for CPU and ML Drift for GPU. LiteRT-LM adds the specialized GenAI libraries and APIs, such as KV-cache management, prompt templating, and function calling. This integrated stack is the same technology powering the Google AI Edge Gallery showcase app.
+The model file size is 2.59 GB, which includes a text decoder with 0.79GB of weights and 1.12GB of embedding parameters. LiteRT-LM framework always keeps main weights in memory, while the embedding parameters are memory mapped which enables significant working memory savings on some platforms as seen in the detailed data below. The vision and audio models are loaded as needed to further reduce memory consumption.
 ## Try Gemma 4 E2B
 | S26 Ultra | CPU | 557 | 46.9 | 1.8 | 2583 | 1733 |
 | S26 Ultra | GPU | 3,808 | 52.1 | 0.3 | 2583 | 676 |
+**🚨 NEW: Android with Speculative Decoding 🚨**
+*The numbers in this section include speculative decoding. Speculative decoding is an optimization that accelerates LLMs by using a small, fast "draft" model to quickly predict multiple upcoming tokens, while a larger “target” model then verifies those tokens in parallel. The effectiveness of speculative decoding is task dependent because the “draft” model can more easily predict the correct tokens of some tasks. The metrics in this section were collected from a variety of sample prompts and grouped into categories by task type. The baseline measurements are an average across all task types. The number of input and output tokens varied across prompts. Note that if you download this model before May 5, 2026, you should re-download the model if you want to use speculative decoding. Speculative decoding is available on CPU and GPU on Mobile and Desktop.*
+| Device &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;| Backend | Task Type | Speculative Decoding? | Decode (tokens/sec) | CPU Memory (MB) |
+| :---- | :---- | :---- | :---- | :---- | :---- |
+| S26 Ultra | CPU | Baseline | No | 40.7 | 1362 |
+| S26 Ultra | CPU | Summarize text | Yes | 47.5 | 1582 |
+| S26 Ultra | CPU | Code snippet | Yes | 36.3 | 1440 |
+| S26 Ultra | CPU | Rewrite tone | Yes | 47.1 | 1408 |
+| S26 Ultra | CPU | Free form | Yes | 38.1 | 1459 |
+| S26 Ultra | GPU | Baseline | No | 51.5 | 791 |
+| S26 Ultra | GPU | Summarize text | Yes | 91.7 | 817 |
+| S26 Ultra | GPU | Code snippet | Yes | 84.4 | 788 |
+| S26 Ultra | GPU | Rewrite tone | Yes | 87.4 | 762 |
+| S26 Ultra | GPU | Free form | Yes | 66.5 | 804 |
 **iOS**