FayssalJ Claude Opus 4.5 commited on
Commit
76fca23
·
1 Parent(s): ec0ee7c

Initial setup: Visual Search with Jina CLIP v2

Browse files

- indexer/: Local indexing script using Jina CLIP v2
- hf-space/: HuggingFace Space app for search API
- CLAUDE.md: Project documentation

Architecture:
- Local model for indexing (free, no API costs)
- HF Space with ZeroGPU for search (free)
- Pinecone for vector storage (free tier)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

CLAUDE.md ADDED
@@ -0,0 +1,76 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Visual Search Project
2
+
3
+ ## Overview
4
+ AI-powered visual product search for Shopify stores using Jina CLIP v2 embeddings.
5
+
6
+ ## Architecture
7
+
8
+ ```
9
+ ┌─────────────────────────────────────────────────────────────┐
10
+ │ INDEXING (Local, one-time) │
11
+ │ Local Jina CLIP v2 → embeddings → Pinecone │
12
+ └─────────────────────────────────────────────────────────────┘
13
+
14
+ ┌─────────────────────────────────────────────────────────────┐
15
+ │ SEARCH (HuggingFace Space, free) │
16
+ │ User image → HF Space (Jina CLIP v2) → Pinecone → Results │
17
+ └─────────────────────────────────────────────────────────────┘
18
+ ```
19
+
20
+ ## Components
21
+
22
+ | Component | Location | Purpose |
23
+ |-----------|----------|---------|
24
+ | `indexer/` | Local script | Index products to Pinecone |
25
+ | `hf-space/` | HuggingFace Space | Search API endpoint |
26
+ | `shopify/` | Theme integration | Frontend UI |
27
+
28
+ ## Tech Stack
29
+ - **Model**: Jina CLIP v2 (jinaai/jina-clip-v2)
30
+ - **Vector DB**: Pinecone (free tier)
31
+ - **Search API**: HuggingFace Spaces (ZeroGPU, free)
32
+ - **Frontend**: Shopify theme integration
33
+
34
+ ## Environment Variables
35
+
36
+ ### Indexer (.env)
37
+ ```
38
+ SHOPIFY_STORE=25c0da-4
39
+ SHOPIFY_ADMIN_TOKEN=shpat_xxxxx
40
+ PINECONE_API_KEY=xxxxx
41
+ PINECONE_HOST=xxxxx.pinecone.io
42
+ ```
43
+
44
+ ### HF Space (Secrets)
45
+ ```
46
+ PINECONE_API_KEY=xxxxx
47
+ PINECONE_HOST=xxxxx.pinecone.io
48
+ ```
49
+
50
+ ## Pinecone Index
51
+ - **Name**: products (or shopify-llm)
52
+ - **Dimensions**: 512
53
+ - **Metric**: cosine
54
+
55
+ ## Future Plans
56
+ - Sales pattern analysis using visual embeddings
57
+ - Cluster similar products → correlate with sales
58
+ - Predict new product performance
59
+
60
+ ## Commands
61
+
62
+ ```bash
63
+ # Index products (run locally)
64
+ cd indexer
65
+ pip install -r requirements.txt
66
+ python index.py --clear
67
+
68
+ # Deploy HF Space
69
+ cd hf-space
70
+ # Push to HuggingFace
71
+ ```
72
+
73
+ ## Related
74
+ - Theme repo: Kuwait-v6
75
+ - Store: https://25c0da-4.myshopify.com
76
+ - Store domain: https://alnasser.net
hf-space/README.md ADDED
@@ -0,0 +1,43 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ ---
2
+ title: Visual Product Search
3
+ emoji: 🔍
4
+ colorFrom: blue
5
+ colorTo: purple
6
+ sdk: gradio
7
+ sdk_version: 4.44.0
8
+ app_file: app.py
9
+ pinned: false
10
+ license: mit
11
+ ---
12
+
13
+ # Visual Product Search API
14
+
15
+ AI-powered visual search using **Jina CLIP v2** embeddings.
16
+
17
+ ## Features
18
+ - Upload an image to find visually similar products
19
+ - Uses Jina CLIP v2 for state-of-the-art image embeddings
20
+ - Queries Pinecone vector database for similarity search
21
+
22
+ ## API Usage
23
+
24
+ ```python
25
+ from gradio_client import Client
26
+
27
+ client = Client("YOUR_USERNAME/visual-search")
28
+ result = client.predict(
29
+ "path/to/image.jpg",
30
+ api_name="/predict"
31
+ )
32
+ print(result)
33
+ ```
34
+
35
+ ## Setup
36
+
37
+ Set these secrets in HuggingFace Space settings:
38
+ - `PINECONE_API_KEY`: Your Pinecone API key
39
+ - `PINECONE_HOST`: Your Pinecone index host (without https://)
40
+
41
+ ## Model
42
+
43
+ Uses [jinaai/jina-clip-v2](https://huggingface.co/jinaai/jina-clip-v2) - a multilingual multimodal embedding model.
hf-space/app.py ADDED
@@ -0,0 +1,163 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Visual Search API - HuggingFace Space
3
+
4
+ Provides image embedding endpoint using Jina CLIP v2.
5
+ Queries Pinecone for similar products.
6
+
7
+ Deploy to HuggingFace Spaces with ZeroGPU (free).
8
+ """
9
+
10
+ import os
11
+ import gradio as gr
12
+ import torch
13
+ import numpy as np
14
+ from PIL import Image
15
+
16
+ # Pinecone config from HF Secrets
17
+ PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
18
+ PINECONE_HOST = os.environ.get('PINECONE_HOST')
19
+
20
+ # Model (loaded on first use)
21
+ model = None
22
+
23
+
24
+ def load_model():
25
+ """Load Jina CLIP v2 model."""
26
+ global model
27
+ if model is None:
28
+ print("Loading Jina CLIP v2...")
29
+ from transformers import AutoModel
30
+ model = AutoModel.from_pretrained(
31
+ "jinaai/jina-clip-v2",
32
+ trust_remote_code=True
33
+ )
34
+ if torch.cuda.is_available():
35
+ model = model.cuda()
36
+ model.eval()
37
+ print("Model loaded!")
38
+ return model
39
+
40
+
41
+ def get_embedding(image: Image.Image) -> list:
42
+ """Generate 512-dim embedding for an image."""
43
+ m = load_model()
44
+
45
+ with torch.no_grad():
46
+ emb = m.encode_image(image)
47
+ if hasattr(emb, 'cpu'):
48
+ emb = emb.cpu().numpy()
49
+ emb = emb.flatten()
50
+ emb = emb / np.linalg.norm(emb) # L2 normalize
51
+ if len(emb) > 512:
52
+ emb = emb[:512]
53
+ return emb.tolist()
54
+
55
+
56
+ def query_pinecone(embedding: list, top_k: int = 12) -> list:
57
+ """Query Pinecone for similar products."""
58
+ if not PINECONE_API_KEY or not PINECONE_HOST:
59
+ return []
60
+
61
+ import requests
62
+
63
+ resp = requests.post(
64
+ f"https://{PINECONE_HOST}/query",
65
+ headers={
66
+ "Api-Key": PINECONE_API_KEY,
67
+ "Content-Type": "application/json"
68
+ },
69
+ json={
70
+ "vector": embedding,
71
+ "topK": top_k,
72
+ "includeMetadata": True
73
+ },
74
+ timeout=15
75
+ )
76
+
77
+ if resp.status_code != 200:
78
+ return []
79
+
80
+ matches = resp.json().get('matches', [])
81
+ return [
82
+ {
83
+ 'handle': m.get('metadata', {}).get('handle', m.get('id')),
84
+ 'title': m.get('metadata', {}).get('title', ''),
85
+ 'score': m.get('score', 0),
86
+ 'image_url': m.get('metadata', {}).get('image_url', '')
87
+ }
88
+ for m in matches
89
+ ]
90
+
91
+
92
+ def search(image: Image.Image) -> dict:
93
+ """
94
+ Main search function.
95
+ Returns embedding and similar products.
96
+ """
97
+ if image is None:
98
+ return {"error": "No image provided"}
99
+
100
+ # Get embedding
101
+ embedding = get_embedding(image)
102
+
103
+ # Query Pinecone
104
+ products = query_pinecone(embedding)
105
+
106
+ return {
107
+ "embedding": embedding,
108
+ "products": products
109
+ }
110
+
111
+
112
+ def search_simple(image: Image.Image) -> str:
113
+ """Simple search returning product handles."""
114
+ if image is None:
115
+ return "No image"
116
+
117
+ embedding = get_embedding(image)
118
+ products = query_pinecone(embedding)
119
+
120
+ if not products:
121
+ return "No similar products found"
122
+
123
+ return "\n".join([
124
+ f"{i+1}. {p['title']} ({p['handle']}) - {p['score']:.2f}"
125
+ for i, p in enumerate(products)
126
+ ])
127
+
128
+
129
+ # Gradio Interface
130
+ with gr.Blocks(title="Visual Search API") as demo:
131
+ gr.Markdown("# Visual Product Search")
132
+ gr.Markdown("Upload an image to find similar products.")
133
+
134
+ with gr.Row():
135
+ with gr.Column():
136
+ image_input = gr.Image(type="pil", label="Upload Image")
137
+ search_btn = gr.Button("Search", variant="primary")
138
+
139
+ with gr.Column():
140
+ output = gr.Textbox(label="Results", lines=15)
141
+
142
+ search_btn.click(
143
+ fn=search_simple,
144
+ inputs=[image_input],
145
+ outputs=[output]
146
+ )
147
+
148
+ gr.Markdown("---")
149
+ gr.Markdown("### API Endpoint")
150
+ gr.Markdown("""
151
+ Use the `/api/predict` endpoint for programmatic access:
152
+
153
+ ```python
154
+ from gradio_client import Client
155
+
156
+ client = Client("YOUR_SPACE_URL")
157
+ result = client.predict(image_path, api_name="/predict")
158
+ ```
159
+ """)
160
+
161
+
162
+ if __name__ == "__main__":
163
+ demo.launch()
hf-space/requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ torch
2
+ transformers
3
+ pillow
4
+ numpy
5
+ requests
6
+ einops
7
+ timm
8
+ gradio
indexer/.env.example ADDED
@@ -0,0 +1,7 @@
 
 
 
 
 
 
 
 
1
+ # Shopify Store
2
+ SHOPIFY_STORE=25c0da-4
3
+ SHOPIFY_ADMIN_TOKEN=shpat_xxxxx
4
+
5
+ # Pinecone
6
+ PINECONE_API_KEY=xxxxx
7
+ PINECONE_HOST=xxxxx.pinecone.io
indexer/.gitignore ADDED
@@ -0,0 +1,6 @@
 
 
 
 
 
 
 
1
+ .env
2
+ *.log
3
+ __pycache__/
4
+ *.pyc
5
+ venv/
6
+ .venv/
indexer/index.py ADDED
@@ -0,0 +1,287 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Visual Search Product Indexer
4
+
5
+ Indexes Shopify products into Pinecone using local Jina CLIP v2 model.
6
+ Uses the SAME model as the HF Space search endpoint for compatible embeddings.
7
+
8
+ Usage:
9
+ python index.py # Index all products
10
+ python index.py --limit 10 # Test with 10 products
11
+ python index.py --clear # Clear index first
12
+ python index.py --dry-run # Test without uploading
13
+ """
14
+
15
+ import os
16
+ import sys
17
+ import argparse
18
+ import time
19
+ from io import BytesIO
20
+ from pathlib import Path
21
+
22
+ try:
23
+ import torch
24
+ from PIL import Image
25
+ import requests
26
+ from tqdm import tqdm
27
+ from pinecone import Pinecone
28
+ except ImportError as e:
29
+ print(f"Missing package: {e}")
30
+ print("Run: pip install -r requirements.txt")
31
+ sys.exit(1)
32
+
33
+
34
+ def load_env():
35
+ """Load .env file."""
36
+ env_path = Path(__file__).parent / '.env'
37
+ if env_path.exists():
38
+ print(f"Loading {env_path}")
39
+ for line in env_path.read_text().splitlines():
40
+ line = line.strip()
41
+ if line and not line.startswith('#') and '=' in line:
42
+ key, value = line.split('=', 1)
43
+ os.environ[key.strip()] = value.strip().strip('"\'')
44
+
45
+
46
+ load_env()
47
+
48
+ # Config
49
+ SHOPIFY_STORE = os.environ.get('SHOPIFY_STORE', '25c0da-4')
50
+ SHOPIFY_ADMIN_TOKEN = os.environ.get('SHOPIFY_ADMIN_TOKEN')
51
+ PINECONE_API_KEY = os.environ.get('PINECONE_API_KEY')
52
+ PINECONE_HOST = os.environ.get('PINECONE_HOST')
53
+ API_VERSION = "2024-01"
54
+
55
+ # Model (loaded lazily)
56
+ model = None
57
+ device = None
58
+
59
+
60
+ def check_config():
61
+ """Validate environment variables."""
62
+ missing = []
63
+ if not SHOPIFY_ADMIN_TOKEN:
64
+ missing.append('SHOPIFY_ADMIN_TOKEN')
65
+ if not PINECONE_API_KEY:
66
+ missing.append('PINECONE_API_KEY')
67
+ if not PINECONE_HOST:
68
+ missing.append('PINECONE_HOST')
69
+
70
+ if missing:
71
+ print("Missing environment variables:")
72
+ for v in missing:
73
+ print(f" - {v}")
74
+ print("\nCopy .env.example to .env and fill in values")
75
+ sys.exit(1)
76
+
77
+
78
+ def load_model():
79
+ """Load Jina CLIP v2 model."""
80
+ global model, device
81
+
82
+ print("Loading Jina CLIP v2 model...")
83
+ print("(First run downloads ~2GB)")
84
+
85
+ from transformers import AutoModel
86
+
87
+ device = "cuda" if torch.cuda.is_available() else "cpu"
88
+ print(f"Using: {device.upper()}")
89
+
90
+ model = AutoModel.from_pretrained(
91
+ "jinaai/jina-clip-v2",
92
+ trust_remote_code=True
93
+ ).to(device).eval()
94
+
95
+ print("Model loaded!")
96
+
97
+
98
+ def get_pinecone():
99
+ """Connect to Pinecone."""
100
+ print("Connecting to Pinecone...")
101
+ pc = Pinecone(api_key=PINECONE_API_KEY)
102
+ index = pc.Index(host=f"https://{PINECONE_HOST}")
103
+ stats = index.describe_index_stats()
104
+ print(f"Connected! {stats.get('total_vector_count', 0)} vectors")
105
+ return index
106
+
107
+
108
+ def fetch_products(limit=None, tags=None):
109
+ """Fetch products from Shopify."""
110
+ print(f"Fetching products from {SHOPIFY_STORE}...")
111
+ if tags:
112
+ print(f" Tags filter: {tags}")
113
+
114
+ products = []
115
+ url = f"https://{SHOPIFY_STORE}.myshopify.com/admin/api/{API_VERSION}/products.json?limit=250&status=active&order=created_at%20desc"
116
+ headers = {"X-Shopify-Access-Token": SHOPIFY_ADMIN_TOKEN}
117
+
118
+ while url:
119
+ resp = requests.get(url, headers=headers, timeout=30)
120
+ resp.raise_for_status()
121
+ batch = resp.json().get('products', [])
122
+
123
+ # Filter by tags
124
+ if tags:
125
+ tag_list = [t.strip().lower() for t in tags.split(',')]
126
+ batch = [p for p in batch if any(
127
+ t.lower() in [x.strip().lower() for x in p.get('tags', '').split(',')]
128
+ for t in tag_list
129
+ )]
130
+
131
+ products.extend(batch)
132
+ print(f" {len(products)} products...", end='\r')
133
+
134
+ if limit and len(products) >= limit:
135
+ products = products[:limit]
136
+ break
137
+
138
+ # Pagination
139
+ url = None
140
+ link = resp.headers.get('Link', '')
141
+ if 'rel="next"' in link:
142
+ for part in link.split(','):
143
+ if 'rel="next"' in part:
144
+ url = part.split('<')[1].split('>')[0]
145
+
146
+ print(f"\nFetched {len(products)} products")
147
+ return products
148
+
149
+
150
+ def download_image(url):
151
+ """Download image as PIL."""
152
+ try:
153
+ url = url + ('&' if '?' in url else '?') + 'width=512'
154
+ resp = requests.get(url, timeout=15)
155
+ resp.raise_for_status()
156
+ return Image.open(BytesIO(resp.content)).convert('RGB')
157
+ except:
158
+ return None
159
+
160
+
161
+ def get_embedding(image):
162
+ """Generate embedding."""
163
+ global model
164
+ try:
165
+ with torch.no_grad():
166
+ emb = model.encode_image(image)
167
+ if hasattr(emb, 'cpu'):
168
+ emb = emb.cpu().numpy()
169
+ emb = emb.flatten()
170
+ emb = emb / (emb ** 2).sum() ** 0.5 # L2 normalize
171
+ if len(emb) > 512:
172
+ emb = emb[:512]
173
+ return emb.tolist()
174
+ except Exception as e:
175
+ print(f"\nEmbedding error: {e}")
176
+ return None
177
+
178
+
179
+ def get_price(product):
180
+ """Extract price from variants."""
181
+ try:
182
+ return float(product.get('variants', [{}])[0].get('price', 0))
183
+ except:
184
+ return 0.0
185
+
186
+
187
+ def main():
188
+ parser = argparse.ArgumentParser(description='Index products for visual search')
189
+ parser.add_argument('--limit', type=int, help='Limit products')
190
+ parser.add_argument('--tags', type=str, default='clothing,footwear', help='Filter by tags')
191
+ parser.add_argument('--batch-size', type=int, default=100, help='Pinecone batch size')
192
+ parser.add_argument('--clear', action='store_true', help='Clear index first')
193
+ parser.add_argument('--dry-run', action='store_true', help='No upload')
194
+ args = parser.parse_args()
195
+
196
+ print("=" * 50)
197
+ print(" Visual Search Indexer")
198
+ print("=" * 50)
199
+
200
+ check_config()
201
+ load_model()
202
+
203
+ index = None
204
+ if not args.dry_run:
205
+ index = get_pinecone()
206
+ if args.clear:
207
+ print("Clearing index...")
208
+ index.delete(delete_all=True)
209
+ time.sleep(2)
210
+
211
+ products = fetch_products(limit=args.limit, tags=args.tags)
212
+ if not products:
213
+ print("No products found!")
214
+ return
215
+
216
+ print(f"\nIndexing {len(products)} products...")
217
+
218
+ vectors = []
219
+ ok, skip, err = 0, 0, 0
220
+
221
+ for product in tqdm(products, desc="Processing"):
222
+ if not product.get('images'):
223
+ skip += 1
224
+ continue
225
+
226
+ try:
227
+ # Get default image
228
+ images = product['images']
229
+ img_data = next((i for i in images if i.get('position') == 1), images[0])
230
+ img_url = img_data['src']
231
+
232
+ # Download & embed
233
+ img = download_image(img_url)
234
+ if not img:
235
+ err += 1
236
+ continue
237
+
238
+ emb = get_embedding(img)
239
+ if not emb:
240
+ err += 1
241
+ continue
242
+
243
+ # Build vector with metadata for future analysis
244
+ tags = [t.strip() for t in product.get('tags', '').split(',') if t.strip()]
245
+
246
+ vectors.append({
247
+ 'id': str(product['id']),
248
+ 'values': emb,
249
+ 'metadata': {
250
+ 'product_id': product['id'],
251
+ 'handle': product['handle'],
252
+ 'title': product['title'],
253
+ 'vendor': product.get('vendor', ''),
254
+ 'product_type': product.get('product_type', ''),
255
+ 'tags': tags[:20],
256
+ 'price': get_price(product),
257
+ 'created_at': product.get('created_at', ''),
258
+ 'image_url': img_url
259
+ }
260
+ })
261
+ ok += 1
262
+
263
+ # Batch upload
264
+ if len(vectors) >= args.batch_size and not args.dry_run:
265
+ index.upsert(vectors=vectors)
266
+ vectors = []
267
+
268
+ except Exception as e:
269
+ err += 1
270
+
271
+ # Final batch
272
+ if vectors and not args.dry_run:
273
+ index.upsert(vectors=vectors)
274
+
275
+ print("\n" + "=" * 50)
276
+ print(" Done!")
277
+ print("=" * 50)
278
+ print(f" Indexed: {ok}")
279
+ print(f" Skipped: {skip}")
280
+ print(f" Errors: {err}")
281
+ if args.dry_run:
282
+ print(" (dry run - nothing uploaded)")
283
+ print("=" * 50)
284
+
285
+
286
+ if __name__ == "__main__":
287
+ main()
indexer/requirements.txt ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ torch
2
+ transformers
3
+ pillow
4
+ pinecone-client
5
+ requests
6
+ tqdm
7
+ einops
8
+ timm