Instructions to use tobil/qmd-query-expansion-1.7B with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use tobil/qmd-query-expansion-1.7B with llama-cpp-python:

# !pip install llama-cpp-python

from llama_cpp import Llama

llm = Llama.from_pretrained(
	repo_id="tobil/qmd-query-expansion-1.7B",
	filename="qmd-query-expansion-1.7B-Q4_0.gguf",
)

llm.create_chat_completion(
	messages = "No input example has been defined for this model task."
)

Notebooks
Google Colab
Kaggle
Local Apps Settings

llama.cpp

How to use tobil/qmd-query-expansion-1.7B with llama.cpp:

Install (macOS, Linux)

curl -LsSf https://llama.app/install.sh | sh
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tobil/qmd-query-expansion-1.7B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Install from WinGet (Windows)

winget install llama.cpp
# Start a local OpenAI-compatible server with a web UI:
llama serve -hf tobil/qmd-query-expansion-1.7B:Q4_K_M
# Run inference directly in the terminal:
llama cli -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Use pre-built binary

# Download pre-built binary from:
# https://github.com/ggerganov/llama.cpp/releases
# Start a local OpenAI-compatible server with a web UI:
./llama-server -hf tobil/qmd-query-expansion-1.7B:Q4_K_M
# Run inference directly in the terminal:
./llama-cli -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Build from source code

git clone https://github.com/ggerganov/llama.cpp.git
cd llama.cpp
cmake -B build
cmake --build build -j --target llama-server llama-cli
# Start a local OpenAI-compatible server with a web UI:
./build/bin/llama-server -hf tobil/qmd-query-expansion-1.7B:Q4_K_M
# Run inference directly in the terminal:
./build/bin/llama-cli -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Use Docker

docker model run hf.co/tobil/qmd-query-expansion-1.7B:Q4_K_M

LM Studio
Jan
Ollama
How to use tobil/qmd-query-expansion-1.7B with Ollama:
```
ollama run hf.co/tobil/qmd-query-expansion-1.7B:Q4_K_M
```

Unsloth Studio

How to use tobil/qmd-query-expansion-1.7B with Unsloth Studio:

Install Unsloth Studio (macOS, Linux, WSL)

curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tobil/qmd-query-expansion-1.7B to start chatting

Install Unsloth Studio (Windows)

irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for tobil/qmd-query-expansion-1.7B to start chatting

Using HuggingFace Spaces for Unsloth

# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for tobil/qmd-query-expansion-1.7B to start chatting

How to use tobil/qmd-query-expansion-1.7B with Pi:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Configure the model in Pi

# Install Pi:
npm install -g @mariozechner/pi-coding-agent
# Add to ~/.pi/agent/models.json:
{
  "providers": {
    "llama-cpp": {
      "baseUrl": "http://localhost:8080/v1",
      "api": "openai-completions",
      "apiKey": "none",
      "models": [
        {
          "id": "tobil/qmd-query-expansion-1.7B:Q4_K_M"
        }
      ]
    }
  }
}

Run Pi

# Start Pi in your project directory:
pi

Hermes Agent new

How to use tobil/qmd-query-expansion-1.7B with Hermes Agent:

Start the llama.cpp server

# Install llama.cpp:
brew install llama.cpp
# Start a local OpenAI-compatible server:
llama serve -hf tobil/qmd-query-expansion-1.7B:Q4_K_M

Configure Hermes

# Install Hermes:
curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash
hermes setup
# Point Hermes at the local server:
hermes config set model.provider custom
hermes config set model.base_url http://127.0.0.1:8080/v1
hermes config set model.default tobil/qmd-query-expansion-1.7B:Q4_K_M

Run Hermes

hermes

Atomic Chat new
Docker Model Runner
How to use tobil/qmd-query-expansion-1.7B with Docker Model Runner:
```
docker model run hf.co/tobil/qmd-query-expansion-1.7B:Q4_K_M
```

Lemonade

How to use tobil/qmd-query-expansion-1.7B with Lemonade:

Pull the model

# Download Lemonade from https://lemonade-server.ai/
lemonade pull tobil/qmd-query-expansion-1.7B:Q4_K_M

Run and chat with the model

lemonade run user.qmd-query-expansion-1.7B-Q4_K_M

List all available models

lemonade list

tobil commited on Feb 22

Commit

c8ff036

verified ·

1 Parent(s): 7aef2c5

Update qmd-query-expansion-1.7B with latest SFT weights

Browse files

Files changed (5) hide show

README.md +6 -286
config.json +1 -1
generation_config.json +1 -1
model.safetensors +1 -1
tokenizer_config.json +16 -1

README.md CHANGED Viewed

@@ -1,293 +1,13 @@
----
-license: apache-2.0
-base_model: Qwen/Qwen3-1.7B
-tags:
-  - query-expansion
-  - search
-  - retrieval
-  - rag
-  - hybrid-search
-  - dspy
-  - gepa
-language:
-  - en
-pipeline_tag: text-generation
-datasets:
-  - custom
----
-# QMD Query Expansion 1.7B
-A Qwen3-1.7B model finetuned for **query expansion** in hybrid search systems (RAG). Expands user queries into retrieval-optimized variations for both sparse (BM25) and dense (vector) search.
-**Repository**: [github.com/tobi/qmd](https://github.com/tobi/qmd)
-## What This Model Does
-Given a search query, generates 7 expansions:
-- **1 hyde**: A hypothetical document snippet (50-200 chars) that would answer the query
-- **3 lex**: Keyword phrases (2-5 words) optimized for BM25/sparse search
-- **3 vec**: Natural language sentences (15-30 words) for vector/dense search
-This improves recall in hybrid retrieval systems by matching both exact keywords and semantic meaning.
 ## Prompt Format
-**Critical**: Use this exact format. The model was trained on this specific template.
-```
-Expand this search query:
-<query>
-```
-**Example Input**:
-```
-Expand this search query:
-postgresql jsonb indexing
-```
-**Example Output**:
-```
-hyde: PostgreSQL JSONB supports GIN indexes for fast key lookups and containment queries with @> operator.
-lex: postgresql jsonb gin index
-lex: postgres json indexing strategies
-lex: jsonb index optimization postgresql
-vec: How do I create efficient GIN indexes on JSONB columns in PostgreSQL?
-vec: Best practices for indexing JSON data in PostgreSQL databases.
-vec: Performance comparison of GIN vs BTREE indexes for JSONB fields.
-```
-## Usage
-### With vLLM (Recommended)
-```bash
-# Start server
-vllm serve tobil/qmd-query-expansion-1.7B --port 8000
-# Query
-curl -s http://localhost:8000/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "tobil/qmd-query-expansion-1.7B",
-    "messages": [{"role": "user", "content": "Expand this search query:\npostgresql jsonb indexing"}],
-    "temperature": 0.7,
-    "max_tokens": 400
-  }' | jq -r '.choices[0].message.content'
-```
-### With Transformers
-```python
-from transformers import AutoTokenizer, AutoModelForCausalLM
-model = AutoModelForCausalLM.from_pretrained("tobil/qmd-query-expansion-1.7B")
-tokenizer = AutoTokenizer.from_pretrained("tobil/qmd-query-expansion-1.7B")
-messages = [{"role": "user", "content": "Expand this search query:\nReact hooks tutorial"}]
-text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
-inputs = tokenizer(text, return_tensors="pt")
-outputs = model.generate(**inputs, max_new_tokens=400, temperature=0.7, do_sample=True)
-print(tokenizer.decode(outputs[0], skip_special_tokens=True))
-```
-### With llama.cpp (GGUF)
-```bash
-# Download GGUF (Q8_0 quantized, 2.1GB)
-huggingface-cli download tobil/qmd-query-expansion-1.7B qmd-query-expansion-1.7B-Q8_0.gguf
-# Run
-./llama-cli -m qmd-query-expansion-1.7B-Q8_0.gguf \
-  -p "Expand this search query:\nkubernetes vs docker" \
-  --temp 0.7 -n 400
-```
-## Output Parsing
-The model outputs in line format. Parse with:
-```python
-import re
-def parse_expansions(text: str) -> list[dict]:
-    """Parse line-based expansion output into structured format."""
-    expansions = []
-    # Remove thinking tags if present (Qwen3 feature)
-    text = re.sub(r'<think>.*?</think>', '', text, flags=re.DOTALL)
-    for line in text.strip().split('\n'):
-        line = line.strip()
-        match = re.match(r'^(hyde|lex|vec)\s*:\s*(.+)$', line, re.IGNORECASE)
-        if match:
-            expansions.append({
-                "type": match.group(1).lower(),
-                "value": match.group(2).strip()
-            })
-    return expansions
-# Example
-output = """hyde: PostgreSQL JSONB supports GIN indexes for fast queries.
-lex: postgresql jsonb gin index
-lex: postgres json indexing
-lex: jsonb optimization
-vec: How to create GIN indexes on JSONB columns?
-vec: Best practices for PostgreSQL JSON indexing.
-vec: JSONB vs JSON performance comparison."""
-expansions = parse_expansions(output)
-# [{"type": "hyde", "value": "PostgreSQL JSONB supports..."}, ...]
-```
-## Training Details
-### Method: GEPA Distillation
-1. **Teacher Model**: GPT-4o-mini with GEPA-optimized prompt
-2. **Prompt Optimization**: DSPy's GEPA (Grounded Example-based Prompt Adaptation) automatically evolved the teacher prompt over 34 iterations to reach 87.7% on our scoring metric
-3. **Distillation**: Generated 500+ high-quality training examples from teacher
-4. **Student Training**: SFT with LoRA on Qwen3-1.7B, 3 epochs
-### Key Learnings
-#### 1. Hyde-First Ordering Matters
-Generating the hypothetical document (hyde) first provides context that improves lex and vec quality. The hyde acts as an "anchor" that grounds subsequent expansions.
-```
-✅ Good: hyde first, then lex uses hyde context
-hyde: Kubernetes orchestrates containers at scale with auto-scaling...
-lex: kubernetes container orchestration  # informed by hyde
-❌ Bad: lex without context
-lex: container management  # too generic
-```
-#### 2. Entity Preservation is Critical
-Named entities (brands, products, technical terms) must appear in **every** lex expansion. Missing entities tanks BM25 recall.
-```
-Query: "iPhone 15 vs Samsung S24"
-✅ Good lex:
-- "iPhone 15 Samsung S24 comparison"
-- "iPhone 15 vs Samsung S24 specs"
-- "Samsung S24 iPhone 15 camera"
-❌ Bad lex:
-- "smartphone comparison"  # missing entities!
-- "phone camera review"    # missing entities!
-```
-#### 3. Simple Prompts Win for Small Models
-The teacher used a complex DSPy signature format with structured sections. But the small model performed better with the simple training format:
-```
-✅ Use this (matches training):
-"Expand this search query:\n{query}"
-❌ Not this (DSPy signature format):
-"## Inputs\n### query\n{query}\n## Generated Outputs..."
-```
-Complex prompts caused the small model to "leak" instruction fragments into outputs.
-#### 4. Line Format > JSON for Small Models
-Small models struggle with reliable JSON generation. Line-based format is more robust:
-```
-✅ Reliable:
-hyde: Some text here
-lex: keyword phrase
-vec: A full sentence.
-❌ Unreliable for 1.7B:
-[{"type": "hyde", "value": "..."}, ...]
-```
-#### 5. GEPA Prompt Evolution
-GEPA automatically discovered these improvements to the teacher prompt:
-- Explicit examples for edge cases (ambiguous queries like "pin")
-- Emphasis on entity preservation with concrete failure cases
-- Factual grounding examples (Louvre hours, GPS navigation steps)
-- Score targets ("aim for 78-84%") to calibrate quality
-### Training Configuration
-```yaml
-base_model: Qwen/Qwen3-1.7B
-method: SFT with LoRA
-lora_r: 64
-lora_alpha: 128
-learning_rate: 2e-4
-epochs: 3
-batch_size: 4
-gradient_accumulation: 4
-warmup_ratio: 0.1
-scheduler: cosine
-```
-### Metrics
-| Metric | Value |
-|--------|-------|
-| Final Loss | 0.64 |
-| Token Accuracy | 84.7% |
-| Eval Score Range | 80-96% |
-| Training Time | ~7 min (RTX 4090) |
-## Scoring Rubric
-Our evaluation metric scores expansions on:
-1. **Structure** (7 items: 1 hyde, 3 lex, 3 vec)
-2. **Entity Preservation** (all query entities in every lex)
-3. **No Verbatim Echo** (lex shouldn't just repeat the query)
-4. **Hyde Quality** (50-200 chars, informative)
-5. **Vec Quality** (15-30 words, semantic variation)
-6. **Hyde-Lex-Vec Coherence** (lex/vec should build on hyde)
-## Limitations
-- Trained on English queries only
-- May hallucinate facts in hyde (use for retrieval, not as ground truth)
-- Optimized for general knowledge queries; domain-specific queries may need domain-adapted models
-- Qwen3's `<think>` tags sometimes appear (strip them in post-processing)
-## Files
-### Safetensors (for transformers/vLLM)
-- `model.safetensors` - Full precision weights (4.1GB)
-### GGUF Quantizations (for llama.cpp/Ollama)
-| Quant | Size | BPW | Eval Score | Use Case |
-|-------|------|-----|------------|----------|
-| Q8_0 | 2.1GB | 8.5 | 87% | Max quality |
-| Q6_K | 1.6GB | 6.6 | 89% | Good balance |
-| Q5_K_M | 1.4GB | 5.7 | 89% | Recommended |
-| Q4_K_M | 1.2GB | 4.8 | 92% | **Best value** |
-| Q4_0 | 1.2GB | 4.5 | 95% | Smallest |
-**Results:** All quantizations perform excellently on this structured generation task. The eval scores show minimal quality degradation even at Q4_0 - the task (generating hyde/lex/vec expansions) is simple enough that aggressive quantization doesn't hurt. **Q4_K_M is recommended** for the best size/quality tradeoff.
-## Citation
-```bibtex
-@misc{qmd-query-expansion,
-  title={QMD Query Expansion Model},
-  author={Shopify},
-  year={2025},
-  url={https://github.com/tobi/qmd}
-}
-```
-## License
-Apache 2.0

+# QMD Query Expansion 1.7B (SFT)
+Updated Qwen3-1.7B model for query expansion using the production SFT pipeline.
 ## Prompt Format
+The model is trained on messages formatted with the Qwen3 chat template using:
+`/no_think Expand this search query: <query>`
+## Notes
+This checkpoint is the SFT-only trained version (GRPO is not part of the default pipeline).

config.json CHANGED Viewed

@@ -56,7 +56,7 @@
   },
   "sliding_window": null,
   "tie_word_embeddings": true,
-  "transformers_version": "5.0.0",
   "use_cache": true,
   "use_sliding_window": false,
   "vocab_size": 151936

   },
   "sliding_window": null,
   "tie_word_embeddings": true,
+  "transformers_version": "5.2.0",
   "use_cache": true,
   "use_sliding_window": false,
   "vocab_size": 151936

generation_config.json CHANGED Viewed

@@ -9,5 +9,5 @@
   "temperature": 0.6,
   "top_k": 20,
   "top_p": 0.95,
-  "transformers_version": "5.0.0"
 }

   "temperature": 0.6,
   "top_k": 20,
   "top_p": 0.95,
+  "transformers_version": "5.2.0"
 }

model.safetensors CHANGED Viewed

@@ -1,3 +1,3 @@
 version https://git-lfs.github.com/spec/v1
-oid sha256:e520db129fa6880692fe68a22d475b222f95725691c9298e9d3246e9274a3a55
 size 4063515640

 version https://git-lfs.github.com/spec/v1
+oid sha256:79bb7eb18f29b8a8997c960ff0ba610e7fe5d17985bfea451192cdb034b8403b
 size 4063515640

tokenizer_config.json CHANGED Viewed

@@ -5,10 +5,25 @@
   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",
   "is_local": false,
   "model_max_length": 131072,
   "pad_token": "<|endoftext|>",
   "split_special_tokens": false,
   "tokenizer_class": "Qwen2Tokenizer",
   "unk_token": null
-}

   "clean_up_tokenization_spaces": false,
   "eos_token": "<|im_end|>",
   "errors": "replace",
+  "extra_special_tokens": [
+    "<|im_start|>",
+    "<|im_end|>",
+    "<|object_ref_start|>",
+    "<|object_ref_end|>",
+    "<|box_start|>",
+    "<|box_end|>",
+    "<|quad_start|>",
+    "<|quad_end|>",
+    "<|vision_start|>",
+    "<|vision_end|>",
+    "<|vision_pad|>",
+    "<|image_pad|>",
+    "<|video_pad|>"
+  ],
   "is_local": false,
   "model_max_length": 131072,
   "pad_token": "<|endoftext|>",
   "split_special_tokens": false,
   "tokenizer_class": "Qwen2Tokenizer",
   "unk_token": null
+}