---
license: gemma
license_link: LICENSE
library_name: onnx
pipeline_tag: sentence-similarity
base_model: google/embeddinggemma-300m
base_model_relation: quantized
tags:
  - sentence-similarity
  - sentence-transformers
  - feature-extraction
  - onnx
  - onnxruntime
  - gemma
  - embeddinggemma
extra_gated_heading: Access this EmbeddingGemma ONNX derivative
extra_gated_prompt: >-
  This repository contains a Model Derivative of Google's EmbeddingGemma-300M
  and is distributed under the Gemma Terms of Use
  (https://ai.google.dev/gemma/terms) and subject to the Gemma Prohibited Use
  Policy (https://ai.google.dev/gemma/prohibited_use_policy). By requesting
  access you agree to be bound by both documents, which are included verbatim
  in this repository as LICENSE and PROHIBITED_USE_POLICY.md.
extra_gated_button_content: Acknowledge Gemma Terms of Use
---

# EmbeddingGemma-300M — ONNX

**Modified** from [`google/embeddinggemma-300m`](https://huggingface.co/google/embeddinggemma-300m).
This is a Model Derivative under the Gemma Terms of Use. See `LICENSE`,
`NOTICE`, and `PROHIBITED_USE_POLICY.md` in this repository.

## What this is

ONNX (opset 17) export of EmbeddingGemma-300M with the full
`sentence-transformers` pipeline — transformer backbone, mean pooling, and
the two dense projection layers — baked into a single graph, so one forward
pass yields the **final 768-dim sentence embedding**. Intended for CPU / GPU
inference via ONNX Runtime on resource-constrained environments (e.g.
Jetson Orin, Kubernetes init containers) where pulling in `torch` +
`sentence-transformers` is undesirable.

### Files

| File | Precision | Size | Notes |
|---|---|---|---|
| `model.onnx` | fp32 | ~1.2 GB | Reference export, matches upstream accuracy. |
| `model_int8.onnx` | INT8 (dynamic, per-channel) | ~310 MB | ORT `quantize_dynamic` with `QInt8`, `per_channel=True`. Weights-only; activations stay fp32. |

Sanity check against the reference `SentenceTransformer.encode(...)`
implementation and an independently-served TEI endpoint of the upstream
model:

| Comparison | cosine |
|---|---|
| `model.onnx` (fp32) vs upstream `ST.encode` | **1.00000** |
| `model.onnx` (fp32) vs TEI-served reference | **1.00000** |
| `model_int8.onnx` vs upstream `ST.encode` | 0.984 |
| `model_int8.onnx` vs `model.onnx` | 0.984 |

The fp32 export is numerically equivalent to the upstream model. The INT8
variant drifts by ~0.016 cosine — retrieval ranking is preserved but **do
not mix fp32 and INT8 vectors in the same index**.

> **Note (2026-04-19):** an earlier upload of this repository was exported
> with an older `transformers` version than the one the upstream model
> was saved with, which caused the ONNX graph to silently diverge from
> the reference implementation. That artifact has been replaced. If you
> pulled `model.onnx` before 2026-04-19, re-pull — the SHA-256 of the
> current fp32 artifact is `e7a1688794…`. See `NOTICE` for the full
> revision history.

### I/O

| Name | Kind | Shape | Dtype |
|---|---|---|---|
| `input_ids` | input | `[batch, seq]` | int64 |
| `attention_mask` | input | `[batch, seq]` | int64 |
| `token_embeddings` | output | `[batch, seq, 768]` | float32 |
| `sentence_embedding` | output | `[batch, 768]` | float32 |

Max input length: 2048 tokens. Activations are **not** valid in float16;
keep fp32 (or bf16 on supported hardware) as noted by the upstream model
card.

## Usage

```python
import onnxruntime as ort
from transformers import AutoTokenizer
import numpy as np

tok = AutoTokenizer.from_pretrained(".")  # or the HF repo id of this model
sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"])
# or for the INT8 variant:
# sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"])

texts = ["Which planet is known as the Red Planet?"]
enc = tok(texts, padding=True, truncation=True, max_length=2048, return_tensors="np")
outputs = sess.run(
    ["sentence_embedding"],
    {"input_ids": enc["input_ids"].astype(np.int64),
     "attention_mask": enc["attention_mask"].astype(np.int64)},
)
emb = outputs[0]  # shape (1, 768)
# L2-normalize if you want cosine similarity; upstream model already outputs
# normalized vectors, but re-normalize after any MRL truncation.
```

For Matryoshka (MRL) truncation to 512 / 256 / 128 dims, slice
`sentence_embedding[:, :D]` and re-normalize.

## How it was produced

See `export.py`. Summary:

```bash
huggingface-cli login  # upstream is gated
uv add optimum optimum-onnx "transformers>=4.57,<5" \
       "sentence-transformers>=5.1" torch onnx onnxruntime
uv run python export.py
# which runs:
#   python -m optimum.exporters.onnx \
#       --model google/embeddinggemma-300m \
#       --library sentence_transformers \
#       --opset 17 ./embeddinggemma-300m-onnx
# then onnxruntime.quantization.quantize_dynamic(
#       weight_type=QInt8, per_channel=True) -> model_int8.onnx
```

**Do not downgrade `transformers` below 4.57.** The model was saved with
`transformers==4.57.0.dev0`; older versions ship an earlier Gemma 3 Text
forward pass that has since changed. Exporting under a downgraded
`transformers` still passes the exporter's own `max diff` check (the
reference PyTorch model is running the same stale forward) but yields
ONNX outputs that disagree with any correctly-configured serving stack
by ~0.3 cosine. Always verify a fresh export against an independent
oracle before publishing — see the sanity-check table above.

The `--library sentence_transformers` flag makes Optimum trace the full
SBERT forward (backbone → pooling → dense → dense), which is why a single
`sentence_embedding` output is emitted.

## License & redistribution

This model is distributed under the **Gemma Terms of Use**
(see `LICENSE`). You **must** comply with §3.1 (Distribution and
Redistribution) and the **Gemma Prohibited Use Policy** (see
`PROHIBITED_USE_POLICY.md`) for any further redistribution or use.

Required on redistribution (quoting §3.1):
1. Propagate the Section 3.2 use restrictions as an enforceable term to
   your downstream users.
2. Provide all third-party recipients a copy of the Gemma Terms of Use.
3. Mark any files you further modify with a prominent modification notice.
4. Include a `NOTICE` text file with: *"Gemma is provided under and subject
   to the Gemma Terms of Use found at ai.google.dev/gemma/terms"*.

This repository ships all four artifacts (`LICENSE`, `NOTICE`,
`PROHIBITED_USE_POLICY.md`, and this `README.md` documenting the
modifications).

## Credits

- Upstream model: Google DeepMind — `google/embeddinggemma-300m`
  ([model card](https://huggingface.co/google/embeddinggemma-300m),
  [paper](https://arxiv.org/abs/2509.20354)).
- Conversion tooling: 🤗 Optimum + PyTorch ONNX exporter.