--- license: gemma license_link: LICENSE library_name: onnx pipeline_tag: sentence-similarity base_model: google/embeddinggemma-300m base_model_relation: quantized tags: - sentence-similarity - sentence-transformers - feature-extraction - onnx - onnxruntime - gemma - embeddinggemma extra_gated_heading: Access this EmbeddingGemma ONNX derivative extra_gated_prompt: >- This repository contains a Model Derivative of Google's EmbeddingGemma-300M and is distributed under the Gemma Terms of Use (https://ai.google.dev/gemma/terms) and subject to the Gemma Prohibited Use Policy (https://ai.google.dev/gemma/prohibited_use_policy). By requesting access you agree to be bound by both documents, which are included verbatim in this repository as LICENSE and PROHIBITED_USE_POLICY.md. extra_gated_button_content: Acknowledge Gemma Terms of Use --- # EmbeddingGemma-300M — ONNX **Modified** from [`google/embeddinggemma-300m`](https://huggingface.co/google/embeddinggemma-300m). This is a Model Derivative under the Gemma Terms of Use. See `LICENSE`, `NOTICE`, and `PROHIBITED_USE_POLICY.md` in this repository. ## What this is ONNX (opset 17) export of EmbeddingGemma-300M with the full `sentence-transformers` pipeline — transformer backbone, mean pooling, and the two dense projection layers — baked into a single graph, so one forward pass yields the **final 768-dim sentence embedding**. Intended for CPU / GPU inference via ONNX Runtime on resource-constrained environments (e.g. Jetson Orin, Kubernetes init containers) where pulling in `torch` + `sentence-transformers` is undesirable. ### Files | File | Precision | Size | Notes | |---|---|---|---| | `model.onnx` | fp32 | ~1.2 GB | Reference export, matches upstream accuracy. | | `model_int8.onnx` | INT8 (dynamic, per-channel) | ~310 MB | ORT `quantize_dynamic` with `QInt8`, `per_channel=True`. Weights-only; activations stay fp32. | Sanity check against the reference `SentenceTransformer.encode(...)` implementation and an independently-served TEI endpoint of the upstream model: | Comparison | cosine | |---|---| | `model.onnx` (fp32) vs upstream `ST.encode` | **1.00000** | | `model.onnx` (fp32) vs TEI-served reference | **1.00000** | | `model_int8.onnx` vs upstream `ST.encode` | 0.984 | | `model_int8.onnx` vs `model.onnx` | 0.984 | The fp32 export is numerically equivalent to the upstream model. The INT8 variant drifts by ~0.016 cosine — retrieval ranking is preserved but **do not mix fp32 and INT8 vectors in the same index**. > **Note (2026-04-19):** an earlier upload of this repository was exported > with an older `transformers` version than the one the upstream model > was saved with, which caused the ONNX graph to silently diverge from > the reference implementation. That artifact has been replaced. If you > pulled `model.onnx` before 2026-04-19, re-pull — the SHA-256 of the > current fp32 artifact is `e7a1688794…`. See `NOTICE` for the full > revision history. ### I/O | Name | Kind | Shape | Dtype | |---|---|---|---| | `input_ids` | input | `[batch, seq]` | int64 | | `attention_mask` | input | `[batch, seq]` | int64 | | `token_embeddings` | output | `[batch, seq, 768]` | float32 | | `sentence_embedding` | output | `[batch, 768]` | float32 | Max input length: 2048 tokens. Activations are **not** valid in float16; keep fp32 (or bf16 on supported hardware) as noted by the upstream model card. ## Usage ```python import onnxruntime as ort from transformers import AutoTokenizer import numpy as np tok = AutoTokenizer.from_pretrained(".") # or the HF repo id of this model sess = ort.InferenceSession("model.onnx", providers=["CPUExecutionProvider"]) # or for the INT8 variant: # sess = ort.InferenceSession("model_int8.onnx", providers=["CPUExecutionProvider"]) texts = ["Which planet is known as the Red Planet?"] enc = tok(texts, padding=True, truncation=True, max_length=2048, return_tensors="np") outputs = sess.run( ["sentence_embedding"], {"input_ids": enc["input_ids"].astype(np.int64), "attention_mask": enc["attention_mask"].astype(np.int64)}, ) emb = outputs[0] # shape (1, 768) # L2-normalize if you want cosine similarity; upstream model already outputs # normalized vectors, but re-normalize after any MRL truncation. ``` For Matryoshka (MRL) truncation to 512 / 256 / 128 dims, slice `sentence_embedding[:, :D]` and re-normalize. ## How it was produced See `export.py`. Summary: ```bash huggingface-cli login # upstream is gated uv add optimum optimum-onnx "transformers>=4.57,<5" \ "sentence-transformers>=5.1" torch onnx onnxruntime uv run python export.py # which runs: # python -m optimum.exporters.onnx \ # --model google/embeddinggemma-300m \ # --library sentence_transformers \ # --opset 17 ./embeddinggemma-300m-onnx # then onnxruntime.quantization.quantize_dynamic( # weight_type=QInt8, per_channel=True) -> model_int8.onnx ``` **Do not downgrade `transformers` below 4.57.** The model was saved with `transformers==4.57.0.dev0`; older versions ship an earlier Gemma 3 Text forward pass that has since changed. Exporting under a downgraded `transformers` still passes the exporter's own `max diff` check (the reference PyTorch model is running the same stale forward) but yields ONNX outputs that disagree with any correctly-configured serving stack by ~0.3 cosine. Always verify a fresh export against an independent oracle before publishing — see the sanity-check table above. The `--library sentence_transformers` flag makes Optimum trace the full SBERT forward (backbone → pooling → dense → dense), which is why a single `sentence_embedding` output is emitted. ## License & redistribution This model is distributed under the **Gemma Terms of Use** (see `LICENSE`). You **must** comply with §3.1 (Distribution and Redistribution) and the **Gemma Prohibited Use Policy** (see `PROHIBITED_USE_POLICY.md`) for any further redistribution or use. Required on redistribution (quoting §3.1): 1. Propagate the Section 3.2 use restrictions as an enforceable term to your downstream users. 2. Provide all third-party recipients a copy of the Gemma Terms of Use. 3. Mark any files you further modify with a prominent modification notice. 4. Include a `NOTICE` text file with: *"Gemma is provided under and subject to the Gemma Terms of Use found at ai.google.dev/gemma/terms"*. This repository ships all four artifacts (`LICENSE`, `NOTICE`, `PROHIBITED_USE_POLICY.md`, and this `README.md` documenting the modifications). ## Credits - Upstream model: Google DeepMind — `google/embeddinggemma-300m` ([model card](https://huggingface.co/google/embeddinggemma-300m), [paper](https://arxiv.org/abs/2509.20354)). - Conversion tooling: 🤗 Optimum + PyTorch ONNX exporter.