litert-community
/

gemma-4-E2B-it-litert-lm

Model card Files Files and versions

marissaw commited on Apr 1

Commit

cf2420e

·

verified ·

1 Parent(s): 5d1dc65

Update README.md

Files changed (1) hide show

README.md +1 -1

README.md CHANGED Viewed

@@ -35,7 +35,7 @@ Ready to integrate this into your product? Get started [here](https://ai.google.
 All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk.
-CPU memory was measured using, rusage::ru_maxrss on Android, Linux and Raspberry Pi, task_vm_info::phys_footprint on iOS and MacBook and process_memory_counters::PrivateUsage on Windows.
 We use the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8bit weights.

 All benchmarks were taken using 1024 prefill tokens and 256 decode tokens with a context length of 2048 tokens via LiteRT-LM. The model can support up to 32k context length. The inference on CPU is accelerated via the LiteRT XNNPACK delegate with 4 threads. Time-to-first-token does not include load time. Benchmarks were run with caches enabled and initialized. During the first run, the latency and memory usage may differ. Model size is the size of the file on disk.
+CPU memory was measured using, `rusage::ru_maxrss` on Android, Linux and Raspberry Pi, `task_vm_info::phys_footprint` on iOS and MacBook and `process_memory_counters::PrivateUsage` on Windows.
 We use the Gemma quantization scheme that employs a mixture of 2bit, 4bit and 8bit weights.