---
license: mit
tags:
  - coreai
  - clip
  - apple-silicon
  - on-device
---

# CLIP ViT-B/32 — Core AI export (official recipe)

fp16 static export of [openai/clip-vit-base-patch32](https://huggingface.co/openai/clip-vit-base-patch32)
via apple/coreai-models' official recipe (`models/clip/export.py`), with one change: text
inputs are padded to the full 77-token context (`padding="max_length"`) so free-text
queries work, instead of the recipe's 7-token example trace.

Runs out of the box with [CoreAIKit](https://github.com/john-rocky/coreai-kit)'s
`ImageTextEncoder`:

```swift
let encoder = try await ImageTextEncoder()   // downloads this repo
let imageVec = try await encoder.encode(image: cgImage)
let textVec  = try await encoder.encode(text: "red bike at the beach")
let score = ImageTextEncoder.cosineSimilarity(imageVec, textVec)
```

## Bundle layout

```
model/
├── clip-vit-base-patch32_float16_static.aimodel
└── tokenizer.json
```

## Graph contract

| | name | shape | dtype |
|---|---|---|---|
| input | `pixel_values` | [1, 3, 224, 224] | fp16 |
| input | `input_ids` | [3, 77] | int32 |
| input | `attention_mask` | [3, 77] | int32 |
| output | `image_embeds` | [1, 512] | fp16, L2-normalized |
| output | `text_embeds` | [3, 512] | fp16, L2-normalized |
| output | `logits_per_image` / `logits_per_text` | [1, 3] / [3, 1] | fp16 |

Preprocessing: 224×224 resize + CLIP mean/std normalization (handled by
`ImageTextEncoder`).

## Performance

M4 Max: ~3.7 ms per image on the Neural Engine (fp16). Requires macOS 27 beta /
iOS 27 beta (device — the CoreAI framework is not in the iOS Simulator SDK).

## License

Model weights: MIT (OpenAI CLIP); see the upstream repo. Export recipe:
BSD-3-Clause (apple/coreai-models).