Image-Text-to-Text
Transformers
Safetensors
gemma3n
automatic-speech-recognition
automatic-speech-translation
audio-text-to-text
video-text-to-text
matformer
conversational
Instructions to use pranjal-pravesh/gemma-3n-E3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- Transformers
How to use pranjal-pravesh/gemma-3n-E3B with Transformers:
# Use a pipeline as a high-level helper from transformers import pipeline pipe = pipeline("image-text-to-text", model="pranjal-pravesh/gemma-3n-E3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] pipe(text=messages)# Load model directly from transformers import AutoProcessor, AutoModelForMultimodalLM processor = AutoProcessor.from_pretrained("pranjal-pravesh/gemma-3n-E3B") model = AutoModelForMultimodalLM.from_pretrained("pranjal-pravesh/gemma-3n-E3B") messages = [ { "role": "user", "content": [ {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, {"type": "text", "text": "What animal is on the candy?"} ] }, ] inputs = processor.apply_chat_template( messages, add_generation_prompt=True, tokenize=True, return_dict=True, return_tensors="pt", ).to(model.device) outputs = model.generate(**inputs, max_new_tokens=40) print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:])) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- vLLM
How to use pranjal-pravesh/gemma-3n-E3B with vLLM:
Install from pip and serve model
# Install vLLM from pip: pip install vllm # Start the vLLM server: vllm serve "pranjal-pravesh/gemma-3n-E3B" # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:8000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pranjal-pravesh/gemma-3n-E3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker
docker model run hf.co/pranjal-pravesh/gemma-3n-E3B
- SGLang
How to use pranjal-pravesh/gemma-3n-E3B with SGLang:
Install from pip and serve model
# Install SGLang from pip: pip install sglang # Start the SGLang server: python3 -m sglang.launch_server \ --model-path "pranjal-pravesh/gemma-3n-E3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pranjal-pravesh/gemma-3n-E3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }'Use Docker images
docker run --gpus all \ --shm-size 32g \ -p 30000:30000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ --env "HF_TOKEN=<secret>" \ --ipc=host \ lmsysorg/sglang:latest \ python3 -m sglang.launch_server \ --model-path "pranjal-pravesh/gemma-3n-E3B" \ --host 0.0.0.0 \ --port 30000 # Call the server using curl (OpenAI-compatible API): curl -X POST "http://localhost:30000/v1/chat/completions" \ -H "Content-Type: application/json" \ --data '{ "model": "pranjal-pravesh/gemma-3n-E3B", "messages": [ { "role": "user", "content": [ { "type": "text", "text": "Describe this image in one sentence." }, { "type": "image_url", "image_url": { "url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg" } } ] } ] }' - Docker Model Runner
How to use pranjal-pravesh/gemma-3n-E3B with Docker Model Runner:
docker model run hf.co/pranjal-pravesh/gemma-3n-E3B
| license: gemma | |
| library_name: transformers | |
| pipeline_tag: image-text-to-text | |
| extra_gated_button_content: Acknowledge license | |
| base_model: google/gemma-3n-E4B-it | |
| tags: | |
| - automatic-speech-recognition | |
| - automatic-speech-translation | |
| - audio-text-to-text | |
| - video-text-to-text | |
| - matformer | |
| > [!Note] | |
| > This is a submodel derived from `google/gemma-3n-E4B-it`. It has been modified by slicing specific layers and resizing FFN dimensions. It is not the original model. | |
| > To learn more about MatFormers, please review the [launch blog](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide) and generate your own submodels | |
| with the [MatFormer Lab](https://goo.gle/gemma3n-matformer-lab). | |
| > | |
| Skipped layers: [] | |
| FFN hidden dimensions: [2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8] | |
| > [!Note] | |
| > This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`, | |
| > supporting text, audio, and vision (image and video) inputs. | |
| > | |
| > Gemma 3n models have multiple architecture innovations: | |
| > * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator. | |
| > * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab). | |
| > | |
| > Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide) | |
| > and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n). | |
| # Gemma 3n model card | |
| **Model Page**: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n) | |
| **Resources and Technical Documentation**: | |
| - [Responsible Generative AI Toolkit](https://ai.google.dev/responsible) | |
| - [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n) | |
| - [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4) | |
| - [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n) | |
| **Terms of Use**: [Terms](https://ai.google.dev/gemma/terms)\ | |
| **Authors**: Google DeepMind | |
| ## Model Information | |
| Summary description and brief definition of inputs and outputs. | |
| ### Description | |
| Gemma is a family of lightweight, state-of-the-art open models from Google, | |
| built from the same research and technology used to create the Gemini models. | |
| Gemma 3n models are designed for efficient execution on low-resource devices. | |
| They are capable of multimodal input, handling text, image, video, and audio | |
| input, and generating text outputs, with open weights for pre-trained and | |
| instruction-tuned variants. These models were trained with data in over 140 | |
| spoken languages. | |
| Gemma 3n models use selective parameter activation technology to reduce resource | |
| requirements. This technique allows the models to operate at an effective size | |
| of 2B and 4B parameters, which is lower than the total number of parameters they | |
| contain. For more information on Gemma 3n's efficient parameter management | |
| technology, see the | |
| [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters) | |
| page. | |
| ### Inputs and outputs | |
| - **Input:** | |
| - Text string, such as a question, a prompt, or a document to be | |
| summarized | |
| - Images, normalized to 256x256, 512x512, or 768x768 resolution | |
| and encoded to 256 tokens each | |
| - Audio data encoded to 6.25 tokens per second from a single channel | |
| - Total input context of 32K tokens | |
| - **Output:** | |
| - Generated text in response to the input, such as an answer to a | |
| question, analysis of image content, or a summary of a document | |
| - Total output length up to 32K tokens, subtracting the request | |
| input tokens | |
| ### Usage | |
| Below, there are some code snippets on how to get quickly started with running | |
| the model. First, install the Transformers library. Gemma 3n is supported | |
| starting from transformers 4.53.0. | |
| ```sh | |
| $ pip install -U transformers | |
| ``` | |
| Then, copy the snippet from the section that is relevant for your use case. | |
| #### Running with the `pipeline` API | |
| You can initialize the model and processor for inference with `pipeline` as | |
| follows. | |
| ```python | |
| from transformers import pipeline | |
| import torch | |
| pipe = pipeline( | |
| "image-text-to-text", | |
| model="pranjal-pravesh/gemma-3n-E3B", | |
| device="cuda", | |
| torch_dtype=torch.bfloat16, | |
| ) | |
| ``` | |
| With instruction-tuned models, you need to use chat templates to process our | |
| inputs first. Then, you can pass it to the pipeline. | |
| ```python | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": [{"type": "text", "text": "You are a helpful assistant."}] | |
| }, | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"}, | |
| {"type": "text", "text": "What animal is on the candy?"} | |
| ] | |
| } | |
| ] | |
| output = pipe(text=messages, max_new_tokens=200) | |
| print(output[0]["generated_text"][-1]["content"]) | |
| # Okay, let's take a look! | |
| # Based on the image, the animal on the candy is a **turtle**. | |
| # You can see the shell shape and the head and legs. | |
| ``` | |
| #### Running the model on a single GPU | |
| ```python | |
| from transformers import AutoProcessor, Gemma3nForConditionalGeneration | |
| from PIL import Image | |
| import requests | |
| import torch | |
| model_id = "pranjal-pravesh/gemma-3n-E3B" | |
| model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval() | |
| processor = AutoProcessor.from_pretrained(model_id) | |
| messages = [ | |
| { | |
| "role": "system", | |
| "content": [{"type": "text", "text": "You are a helpful assistant."}] | |
| }, | |
| { | |
| "role": "user", | |
| "content": [ | |
| {"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"}, | |
| {"type": "text", "text": "Describe this image in detail."} | |
| ] | |
| } | |
| ] | |
| inputs = processor.apply_chat_template( | |
| messages, | |
| add_generation_prompt=True, | |
| tokenize=True, | |
| return_dict=True, | |
| return_tensors="pt", | |
| ).to(model.device) | |
| input_len = inputs["input_ids"].shape[-1] | |
| with torch.inference_mode(): | |
| generation = model.generate(**inputs, max_new_tokens=100, do_sample=False) | |
| generation = generation[0][input_len:] | |
| decoded = processor.decode(generation, skip_special_tokens=True) | |
| print(decoded) | |
| # **Overall Impression:** The image is a close-up shot of a vibrant garden scene, | |
| # focusing on a cluster of pink cosmos flowers and a busy bumblebee. | |
| # It has a slightly soft, natural feel, likely captured in daylight. | |
| ``` | |