SigLIP 2 B/16-256 — Core ML

Core ML conversion of google/siglip2-base-patch16-256, split into separate image and text encoders for on-device text→image retrieval. Built for Palmier Pro's footage search; usable by anything that wants SigLIP 2 on Apple silicon.

Files

File Contents
ImageEncoder.mlpackage.zip Vision tower, 256×256 input, 8-bit palettized (per-grouped-channel)
TextEncoder.mlpackage.zip Text tower, 64-token input, 8-bit palettized
tokenizer.zip Gemma SentencePiece tokenizer files (tokenizer.json, config)
manifest.json File names, sha256s, sizes, model dims

Both encoders emit L2-normalized 768-d embeddings (embedding output); similarity is a plain dot product. Minimum deployment target: macOS 15.

Usage notes

  • Image preprocessing is a squash-resize to 256×256 (no center crop), pixels scaled to [-1, 1]. The ImageType input already applies the scaling.
  • Text must be tokenized with the bundled Gemma tokenizer and padded to 64 with the pad token (0), no attention mask — SigLIP was trained that way and embeddings drift if padding differs.
  • Conversion is parity-gated: every release's embeddings match the PyTorch reference at cosine ≥ 0.99 on a fixture set. Conversion source: palmier-io/palmier-pro models/siglip2.

Versioning

Files in this repo are immutable once published. Re-conversions are published as new versions, never overwrites.

License

Apache 2.0, same as the original weights by Google. This repository redistributes a converted form of those weights without modification to their values beyond 8-bit palettization.

Downloads last month

-

Downloads are not tracked for this model. How to track
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for palmier-io/siglip2-base-coreml

Finetuned
(3)
this model