--- base_model: Qwen/Qwen3-Embedding-0.6B library_name: sentence-transformers pipeline_tag: feature-extraction tags: [text-to-sql, schema-linking, retrieval, embedding, spider2, bird] --- # embedding-0.6b-spider2.0-v2 Bi-encoder column retriever for text-to-SQL schema linking. Qwen3-Embedding-0.6B, InfoNCE LoRA (merged), 1 epoch. v2 adds BigQuery/Snowflake analytics **domain-adaptation** rows (synthetic questions over public schemas, held out from eval — no question leakage) on top of v1's mix. **Training data:** [`thanhdath/embedding-0.6b-spider2.0-v2-data`](https://huggingface.co/datasets/thanhdath/embedding-0.6b-spider2.0-v2-data) — 42,965 rows: BIRD train + Spider train + Spider 2.0 BQ/SF synthetic (no SynSQL). ## Results vs v1 (column recall@K) **Spider 2.0-547 (flat):** R@100 0.850 / R@300 0.939 / R@500 0.962 (v1: 0.812 / 0.916 / 0.946). With key-completion: R@500 **0.972** @ ~226 cols. **Spider 2.0-233q two-stage (top-50 tables→K):** R@300 0.911 / R@500 0.937 / R@800 0.963 (v1: 0.904 / 0.930 / 0.956). **BIRD dev (flat):** R@50 0.962, R@200 1.000 (no regression). ## Usage ```bash vllm serve thanhdath/embedding-0.6b-spider2.0-v2 --task embed --port 8001 --max-model-len 16384 ``` Score = dot product(question_emb, column_desc_emb); take top-K. For wide schemas, complete the retrieved tables' PK/FK columns (key-completion) for +1.5-4.6 pp at near-zero cost.