--- license: apache-2.0 base_model: Qwen/Qwen3-1.7B tags: - query-expansion - search - retrieval - rag - hybrid-search - dspy - gepa language: - en pipeline_tag: text-generation datasets: - custom --- # QMD Query Expansion 1.7B A Qwen3-1.7B model finetuned for **query expansion** in hybrid search systems (RAG). Expands user queries into retrieval-optimized variations for both sparse (BM25) and dense (vector) search. **Repository**: [github.com/Shopify/qmd](https://github.com/Shopify/qmd) ## What This Model Does Given a search query, generates 7 expansions: - **1 hyde**: A hypothetical document snippet (50-200 chars) that would answer the query - **3 lex**: Keyword phrases (2-5 words) optimized for BM25/sparse search - **3 vec**: Natural language sentences (15-30 words) for vector/dense search This improves recall in hybrid retrieval systems by matching both exact keywords and semantic meaning. ## Prompt Format **Critical**: Use this exact format. The model was trained on this specific template. ``` Expand this search query: ``` **Example Input**: ``` Expand this search query: postgresql jsonb indexing ``` **Example Output**: ``` hyde: PostgreSQL JSONB supports GIN indexes for fast key lookups and containment queries with @> operator. lex: postgresql jsonb gin index lex: postgres json indexing strategies lex: jsonb index optimization postgresql vec: How do I create efficient GIN indexes on JSONB columns in PostgreSQL? vec: Best practices for indexing JSON data in PostgreSQL databases. vec: Performance comparison of GIN vs BTREE indexes for JSONB fields. ``` ## Usage ### With vLLM (Recommended) ```bash # Start server vllm serve tobil/qmd-query-expansion-1.7B --port 8000 # Query curl -s http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "tobil/qmd-query-expansion-1.7B", "messages": [{"role": "user", "content": "Expand this search query:\npostgresql jsonb indexing"}], "temperature": 0.7, "max_tokens": 400 }' | jq -r '.choices[0].message.content' ``` ### With Transformers ```python from transformers import AutoTokenizer, AutoModelForCausalLM model = AutoModelForCausalLM.from_pretrained("tobil/qmd-query-expansion-1.7B") tokenizer = AutoTokenizer.from_pretrained("tobil/qmd-query-expansion-1.7B") messages = [{"role": "user", "content": "Expand this search query:\nReact hooks tutorial"}] text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True) inputs = tokenizer(text, return_tensors="pt") outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True) print(tokenizer.decode(outputs[0], skip_special_tokens=True)) ``` ### With llama.cpp (GGUF) ```bash # Download GGUF (Q8_0 quantized, 2.1GB) huggingface-cli download tobil/qmd-query-expansion-1.7B qmd-query-expansion-1.7B-Q8_0.gguf # Run ./llama-cli -m qmd-query-expansion-1.7B-Q8_0.gguf \ -p "Expand this search query:\nkubernetes vs docker" \ --temp 0.7 -n 400 ``` ## Output Parsing The model outputs in line format. Parse with: ```python import re def parse_expansions(text: str) -> list[dict]: """Parse line-based expansion output into structured format.""" expansions = [] # Remove thinking tags if present (Qwen3 feature) text = re.sub(r'.*?', '', text, flags=re.DOTALL) for line in text.strip().split('\n'): line = line.strip() match = re.match(r'^(hyde|lex|vec)\s*:\s*(.+)$', line, re.IGNORECASE) if match: expansions.append({ "type": match.group(1).lower(), "value": match.group(2).strip() }) return expansions # Example output = """hyde: PostgreSQL JSONB supports GIN indexes for fast queries. lex: postgresql jsonb gin index lex: postgres json indexing lex: jsonb optimization vec: How to create GIN indexes on JSONB columns? vec: Best practices for PostgreSQL JSON indexing. vec: JSONB vs JSON performance comparison.""" expansions = parse_expansions(output) # [{"type": "hyde", "value": "PostgreSQL JSONB supports..."}, ...] ``` ## Training Details ### Method: GEPA Distillation 1. **Teacher Model**: GPT-4o-mini with GEPA-optimized prompt 2. **Prompt Optimization**: DSPy's GEPA (Grounded Example-based Prompt Adaptation) automatically evolved the teacher prompt over 34 iterations to reach 87.7% on our scoring metric 3. **Distillation**: Generated 500+ high-quality training examples from teacher 4. **Student Training**: SFT with LoRA on Qwen3-1.7B, 3 epochs ### Key Learnings #### 1. Hyde-First Ordering Matters Generating the hypothetical document (hyde) first provides context that improves lex and vec quality. The hyde acts as an "anchor" that grounds subsequent expansions. ``` ✅ Good: hyde first, then lex uses hyde context hyde: Kubernetes orchestrates containers at scale with auto-scaling... lex: kubernetes container orchestration # informed by hyde ❌ Bad: lex without context lex: container management # too generic ``` #### 2. Entity Preservation is Critical Named entities (brands, products, technical terms) must appear in **every** lex expansion. Missing entities tanks BM25 recall. ``` Query: "iPhone 15 vs Samsung S24" ✅ Good lex: - "iPhone 15 Samsung S24 comparison" - "iPhone 15 vs Samsung S24 specs" - "Samsung S24 iPhone 15 camera" ❌ Bad lex: - "smartphone comparison" # missing entities! - "phone camera review" # missing entities! ``` #### 3. Simple Prompts Win for Small Models The teacher used a complex DSPy signature format with structured sections. But the small model performed better with the simple training format: ``` ✅ Use this (matches training): "Expand this search query:\n{query}" ❌ Not this (DSPy signature format): "## Inputs\n### query\n{query}\n## Generated Outputs..." ``` Complex prompts caused the small model to "leak" instruction fragments into outputs. #### 4. Line Format > JSON for Small Models Small models struggle with reliable JSON generation. Line-based format is more robust: ``` ✅ Reliable: hyde: Some text here lex: keyword phrase vec: A full sentence. ❌ Unreliable for 1.7B: [{"type": "hyde", "value": "..."}, ...] ``` #### 5. GEPA Prompt Evolution GEPA automatically discovered these improvements to the teacher prompt: - Explicit examples for edge cases (ambiguous queries like "pin") - Emphasis on entity preservation with concrete failure cases - Factual grounding examples (Louvre hours, GPS navigation steps) - Score targets ("aim for 78-84%") to calibrate quality ### Training Configuration ```yaml base_model: Qwen/Qwen3-1.7B method: SFT with LoRA lora_r: 64 lora_alpha: 128 learning_rate: 2e-4 epochs: 3 batch_size: 4 gradient_accumulation: 4 warmup_ratio: 0.1 scheduler: cosine ``` ### Metrics | Metric | Value | |--------|-------| | Final Loss | 0.64 | | Token Accuracy | 84.7% | | Eval Score Range | 80-96% | | Training Time | ~7 min (RTX 4090) | ## Scoring Rubric Our evaluation metric scores expansions on: 1. **Structure** (7 items: 1 hyde, 3 lex, 3 vec) 2. **Entity Preservation** (all query entities in every lex) 3. **No Verbatim Echo** (lex shouldn't just repeat the query) 4. **Hyde Quality** (50-200 chars, informative) 5. **Vec Quality** (15-30 words, semantic variation) 6. **Hyde-Lex-Vec Coherence** (lex/vec should build on hyde) ## Limitations - Trained on English queries only - May hallucinate facts in hyde (use for retrieval, not as ground truth) - Optimized for general knowledge queries; domain-specific queries may need domain-adapted models - Qwen3's `` tags sometimes appear (strip them in post-processing) ## Files - `model.safetensors` - Model weights (4.1GB) - `qmd-query-expansion-1.7B-Q8_0.gguf` - GGUF format for llama.cpp (2.1GB, Q8_0 quantized) - `tokenizer.json` - Tokenizer ## Citation ```bibtex @misc{qmd-query-expansion, title={QMD Query Expansion Model}, author={Shopify}, year={2025}, url={https://github.com/Shopify/qmd} } ``` ## License Apache 2.0