gemma-3n-E3B / README.md
pranjal-pravesh's picture
Update README.md
be1bfb8 verified
---
license: gemma
library_name: transformers
pipeline_tag: image-text-to-text
extra_gated_button_content: Acknowledge license
base_model: google/gemma-3n-E4B-it
tags:
- automatic-speech-recognition
- automatic-speech-translation
- audio-text-to-text
- video-text-to-text
- matformer
---
> [!Note]
> This is a submodel derived from `google/gemma-3n-E4B-it`. It has been modified by slicing specific layers and resizing FFN dimensions. It is not the original model.
> To learn more about MatFormers, please review the [launch blog](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide) and generate your own submodels
with the [MatFormer Lab](https://goo.gle/gemma3n-matformer-lab).
>
Skipped layers: []
FFN hidden dimensions: [2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8]
> [!Note]
> This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`,
> supporting text, audio, and vision (image and video) inputs.
>
> Gemma 3n models have multiple architecture innovations:
> * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
> * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab).
>
> Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide)
> and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n).
# Gemma 3n model card
**Model Page**: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)
**Resources and Technical Documentation**:
- [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
- [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n)
- [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4)
- [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n)
**Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\
**Authors**: Google DeepMind
## Model Information
Summary description and brief definition of inputs and outputs.
### Description
Gemma is a family of lightweight, state-of-the-art open models from Google,
built from the same research and technology used to create the Gemini models.
Gemma 3n models are designed for efficient execution on low-resource devices.
They are capable of multimodal input, handling text, image, video, and audio
input, and generating text outputs, with open weights for pre-trained and
instruction-tuned variants. These models were trained with data in over 140
spoken languages.
Gemma 3n models use selective parameter activation technology to reduce resource
requirements. This technique allows the models to operate at an effective size
of 2B and 4B parameters, which is lower than the total number of parameters they
contain. For more information on Gemma 3n's efficient parameter management
technology, see the
[Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters)
page.
### Inputs and outputs
- **Input:**
- Text string, such as a question, a prompt, or a document to be
summarized
- Images, normalized to 256x256, 512x512, or 768x768 resolution
and encoded to 256 tokens each
- Audio data encoded to 6.25 tokens per second from a single channel
- Total input context of 32K tokens
- **Output:**
- Generated text in response to the input, such as an answer to a
question, analysis of image content, or a summary of a document
- Total output length up to 32K tokens, subtracting the request
input tokens
### Usage
Below, there are some code snippets on how to get quickly started with running
the model. First, install the Transformers library. Gemma 3n is supported
starting from transformers 4.53.0.
```sh
$ pip install -U transformers
```
Then, copy the snippet from the section that is relevant for your use case.
#### Running with the `pipeline` API
You can initialize the model and processor for inference with `pipeline` as
follows.
```python
from transformers import pipeline
import torch
pipe = pipeline(
"image-text-to-text",
model="pranjal-pravesh/gemma-3n-E3B",
device="cuda",
torch_dtype=torch.bfloat16,
)
```
With instruction-tuned models, you need to use chat templates to process our
inputs first. Then, you can pass it to the pipeline.
```python
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
{"type": "text", "text": "What animal is on the candy?"}
]
}
]
output = pipe(text=messages, max_new_tokens=200)
print(output[0]["generated_text"][-1]["content"])
# Okay, let's take a look!
# Based on the image, the animal on the candy is a **turtle**.
# You can see the shell shape and the head and legs.
```
#### Running the model on a single GPU
```python
from transformers import AutoProcessor, Gemma3nForConditionalGeneration
from PIL import Image
import requests
import torch
model_id = "pranjal-pravesh/gemma-3n-E3B"
model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()
processor = AutoProcessor.from_pretrained(model_id)
messages = [
{
"role": "system",
"content": [{"type": "text", "text": "You are a helpful assistant."}]
},
{
"role": "user",
"content": [
{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
with torch.inference_mode():
generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
generation = generation[0][input_len:]
decoded = processor.decode(generation, skip_special_tokens=True)
print(decoded)
# **Overall Impression:** The image is a close-up shot of a vibrant garden scene,
# focusing on a cluster of pink cosmos flowers and a busy bumblebee.
# It has a slightly soft, natural feel, likely captured in daylight.
```