--- license: mit tags: - coreai - clip - apple-silicon - on-device --- # CLIP ViT-B/32 — Core AI export (official recipe) fp16 static export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32) via apple/coreai-models' official recipe (`models/clip/export.py`), with one change: text inputs are padded to the full 77-token context (`padding="max_length"`) so free-text queries work, instead of the recipe's 7-token example trace. Runs out of the box with [CoreAIKit](https://github.com/john-rocky/coreai-kit)'s `ImageTextEncoder`: ```swift let encoder = try await ImageTextEncoder() // downloads this repo let imageVec = try await encoder.encode(image: cgImage) let textVec = try await encoder.encode(text: "red bike at the beach") let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec) ``` ## Bundle layout ``` model/ ├── clip-vit-base-patch32_float16_static.aimodel └── tokenizer.json ``` ## Graph contract | | name | shape | dtype | |---|---|---|---| | input | `pixel_values` | [1, 3, 224, 224] | fp16 | | input | `input_ids` | [3, 77] | int32 | | input | `attention_mask` | [3, 77] | int32 | | output | `image_embeds` | [1, 512] | fp16, L2-normalized | | output | `text_embeds` | [3, 512] | fp16, L2-normalized | | output | `logits_per_image` / `logits_per_text` | [1, 3] / [3, 1] | fp16 | Preprocessing: 224×224 resize + CLIP mean/std normalization (handled by `ImageTextEncoder`). ## Performance M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta / iOS 27 beta (device — the CoreAI framework is not in the iOS Simulator SDK). ## License Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe: BSD-3-Clause (apple/coreai-models).