Instructions to use pranjal-pravesh/gemma-3n-E3B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use pranjal-pravesh/gemma-3n-E3B with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("image-text-to-text", model="pranjal-pravesh/gemma-3n-E3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForMultimodalLM

processor = AutoProcessor.from_pretrained("pranjal-pravesh/gemma-3n-E3B")
model = AutoModelForMultimodalLM.from_pretrained("pranjal-pravesh/gemma-3n-E3B")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use pranjal-pravesh/gemma-3n-E3B with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "pranjal-pravesh/gemma-3n-E3B"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pranjal-pravesh/gemma-3n-E3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker

docker model run hf.co/pranjal-pravesh/gemma-3n-E3B

SGLang

How to use pranjal-pravesh/gemma-3n-E3B with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "pranjal-pravesh/gemma-3n-E3B" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pranjal-pravesh/gemma-3n-E3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "pranjal-pravesh/gemma-3n-E3B" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "pranjal-pravesh/gemma-3n-E3B",
		"messages": [
			{
				"role": "user",
				"content": [
					{
						"type": "text",
						"text": "Describe this image in one sentence."
					},
					{
						"type": "image_url",
						"image_url": {
							"url": "https://cdn.britannica.com/61/93061-050-99147DCE/Statue-of-Liberty-Island-New-York-Bay.jpg"
						}
					}
				]
			}
		]
	}'

Docker Model Runner
How to use pranjal-pravesh/gemma-3n-E3B with Docker Model Runner:
```
docker model run hf.co/pranjal-pravesh/gemma-3n-E3B
```

gemma-3n-E3B / README.md

pranjal-pravesh

Update README.md

be1bfb8 verified 12 months ago

preview code

raw

history blame contribute delete

7.36 kB

	---
	license: gemma
	library_name: transformers
	pipeline_tag: image-text-to-text
	extra_gated_button_content: Acknowledge license
	base_model: google/gemma-3n-E4B-it
	tags:
	- automatic-speech-recognition
	- automatic-speech-translation
	- audio-text-to-text
	- video-text-to-text
	- matformer
	---

	> [!Note]
	> This is a submodel derived from `google/gemma-3n-E4B-it`. It has been modified by slicing specific layers and resizing FFN dimensions. It is not the original model.
	> To learn more about MatFormers, please review the [launch blog](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide) and generate your own submodels
	with the [MatFormer Lab](https://goo.gle/gemma3n-matformer-lab).
	>

	Skipped layers: []

	FFN hidden dimensions: [2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 4, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8, 2_048 * 8]


	> [!Note]
	> This repository corresponds to the launch version of Gemma 3n E4B IT (Instruct), to be used with Hugging Face `transformers`,
	> supporting text, audio, and vision (image and video) inputs.
	>
	> Gemma 3n models have multiple architecture innovations:
	> * They are available in two sizes based on [effective parameters](https://ai.google.dev/gemma/docs/gemma-3n#parameters). While the raw parameter count of this model is 8B, the architecture design allows the model to be run with a memory footprint comparable to a traditional 4B model by offloading low-utilization matrices from the accelerator.
	> * They use a MatFormer architecture that allows nesting sub-models within the E4B model. We provide one sub-model (an [E2B](https://huggingface.co/google/gemma-3n-E2B-it)), or you can access a spectrum of custom-sized models using the [Mix-and-Match method](https://goo.gle/gemma3n-matformer-lab).
	>
	> Learn more about these techniques in the [technical blog post](https://developers.googleblog.com/en/introducing-gemma-3n-developer-guide)
	> and the [Gemma documentation](https://ai.google.dev/gemma/docs/gemma-3n).

	# Gemma 3n model card

	Model Page: [Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n)

	Resources and Technical Documentation:

	- [Responsible Generative AI Toolkit](https://ai.google.dev/responsible)
	- [Gemma on Kaggle](https://www.kaggle.com/models/google/gemma-3n)
	- [Gemma on HuggingFace](https://huggingface.co/collections/google/gemma-3n-685065323f5984ef315c93f4)
	- [Gemma on Vertex Model Garden](https://console.cloud.google.com/vertex-ai/publishers/google/model-garden/gemma3n)

	Terms of Use: [Terms](https://ai.google.dev/gemma/terms)\
	Authors: Google DeepMind

	## Model Information

	Summary description and brief definition of inputs and outputs.

	### Description

	Gemma is a family of lightweight, state-of-the-art open models from Google,
	built from the same research and technology used to create the Gemini models.
	Gemma 3n models are designed for efficient execution on low-resource devices.
	They are capable of multimodal input, handling text, image, video, and audio
	input, and generating text outputs, with open weights for pre-trained and
	instruction-tuned variants. These models were trained with data in over 140
	spoken languages.

	Gemma 3n models use selective parameter activation technology to reduce resource
	requirements. This technique allows the models to operate at an effective size
	of 2B and 4B parameters, which is lower than the total number of parameters they
	contain. For more information on Gemma 3n's efficient parameter management
	technology, see the
	[Gemma 3n](https://ai.google.dev/gemma/docs/gemma-3n#parameters)
	page.

	### Inputs and outputs

	- Input:
	- Text string, such as a question, a prompt, or a document to be
	summarized
	- Images, normalized to 256x256, 512x512, or 768x768 resolution
	and encoded to 256 tokens each
	- Audio data encoded to 6.25 tokens per second from a single channel
	- Total input context of 32K tokens
	- Output:
	- Generated text in response to the input, such as an answer to a
	question, analysis of image content, or a summary of a document
	- Total output length up to 32K tokens, subtracting the request
	input tokens

	### Usage

	Below, there are some code snippets on how to get quickly started with running
	the model. First, install the Transformers library. Gemma 3n is supported
	starting from transformers 4.53.0.

	```sh
	$ pip install -U transformers
	```

	Then, copy the snippet from the section that is relevant for your use case.

	#### Running with the `pipeline` API

	You can initialize the model and processor for inference with `pipeline` as
	follows.

	```python
	from transformers import pipeline
	import torch

	pipe = pipeline(
	"image-text-to-text",
	model="pranjal-pravesh/gemma-3n-E3B",
	device="cuda",
	torch_dtype=torch.bfloat16,
	)
	```

	With instruction-tuned models, you need to use chat templates to process our
	inputs first. Then, you can pass it to the pipeline.

	```python
	messages = [
	{
	"role": "system",
	"content": [{"type": "text", "text": "You are a helpful assistant."}]
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
	{"type": "text", "text": "What animal is on the candy?"}
	]
	}
	]

	output = pipe(text=messages, max_new_tokens=200)
	print(output[0]["generated_text"][-1]["content"])
	# Okay, let's take a look!
	# Based on the image, the animal on the candy is a turtle.
	# You can see the shell shape and the head and legs.
	```

	#### Running the model on a single GPU

	```python
	from transformers import AutoProcessor, Gemma3nForConditionalGeneration
	from PIL import Image
	import requests
	import torch

	model_id = "pranjal-pravesh/gemma-3n-E3B"

	model = Gemma3nForConditionalGeneration.from_pretrained(model_id, device_map="auto", torch_dtype=torch.bfloat16,).eval()

	processor = AutoProcessor.from_pretrained(model_id)

	messages = [
	{
	"role": "system",
	"content": [{"type": "text", "text": "You are a helpful assistant."}]
	},
	{
	"role": "user",
	"content": [
	{"type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"},
	{"type": "text", "text": "Describe this image in detail."}
	]
	}
	]

	inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
	).to(model.device)

	input_len = inputs["input_ids"].shape[-1]

	with torch.inference_mode():
	generation = model.generate(**inputs, max_new_tokens=100, do_sample=False)
	generation = generation[0][input_len:]

	decoded = processor.decode(generation, skip_special_tokens=True)
	print(decoded)

	# Overall Impression: The image is a close-up shot of a vibrant garden scene,
	# focusing on a cluster of pink cosmos flowers and a busy bumblebee.
	# It has a slightly soft, natural feel, likely captured in daylight.
	```