LiteRT-LM

Model with QAT?

#30
by 4ntoine - opened
LiteRT Community (FKA TFLite) org
LiteRT Community (FKA TFLite) org

It's pointing to mobile models, not here: https://huggingface.co/collections/google/gemma-4-qat-mobile

It's pointing to mobile models, not here: https://huggingface.co/collections/google/gemma-4-qat-mobile

can we convert it to litertlm?

LiteRT Community (FKA TFLite) org

The .litertlm models on this card already use the QAT that is discussed in the blog post. The most popular file, gemma-4-E2B-it.litertlm, uses a mixture of int2, int4 and int8 to keep it small, fast and efficient.

LiteRT Community (FKA TFLite) org

@marissaw Does it mean gemma 4 - E2B without audio engine can be executed and consume less than 1gb?

@marissaw Does it mean gemma 4 - E2B without audio engine can be executed and consume less than 1gb?

https://huggingface.co/developerabu/gemma-4-e2b-text-only-litertlm

I unpacked and repack only text weight

LiteRT Community (FKA TFLite) org

Thank you @developerabu !

@marissaw Does it mean gemma 4 - E2B without audio engine can be executed and consume less than 1gb?

The audio, vision and drafter models should all be loaded on-demand. When running the benchmarks for this model card, we ran in a text-only mode so none of the optional models should have been loaded. This means that the memory numbers in the model card should reflect what you are asking for.

The answer is, it depends on if you are running on CPU or GPU, which operating system you are using, which CPU/GPU vendor(s) your device has and how you define memory usage. For example, the model running on S26 Ultra on GPU only uses 676 MB of rusage::ru_maxrss. However, there are other device set ups and definitions of memory usage that could cause the memory to be higher than 1 GB. I'd recommend looking at the model card for more information.

LiteRT Community (FKA TFLite) org

Also, the memory usage depends on how long of a context length you would like to use.

Sign up or log in to comment