--- language: - multilingual license: cc-by-nc-4.0 library_name: sentence-transformers tags: - sentence-transformers - cross-encoder - reranker - affiliation-matching - scholarly-metadata base_model: jinaai/jina-reranker-v2-base-multilingual datasets: - cometadata/triplet-loss-for-embedding-affiliations-sample-1 pipeline_tag: text-classification --- # Jina Affiliation Reranker Cross Encoder reranker model fine-tuned for affiliation string matching. Given a pair of affiliation strings, it predicts how likely they refer to the same institution. ## Use Case This model is designed for matching and disambiguating messy real-world affiliation strings against canonical institution records (ROR). **Examples of what it handles:** - Abbreviations: "MIT" ↔ "Massachusetts Institute of Technology" - Word reordering: "University of Oxford" ↔ "Oxford University" - Partial matches: "Dept. of Physics, Stanford" ↔ "Stanford University" - International variants: "東京大学" ↔ "University of Tokyo" - OCR noise: "Univ ersity of Cal ifornia" ↔ "University of California" ## Usage ```python from sentence_transformers import CrossEncoder model = CrossEncoder( "cometadata/jina-reranker-v2-multilingual-affiliations", trust_remote_code=True, ) # Score affiliation pairs (higher = more likely same institution) pairs = [ ["University of California, Berkeley", "UC Berkeley"], ["University of California, Berkeley", "Berkeley College"], ] scores = model.predict(pairs) # [0.82, 0.15] - first pair matches, second doesn't # Rank candidates for an affiliation string results = model.rank( "MIT, Cambridge, MA", [ "Massachusetts Institute of Technology", "MIT University (India)", "University of Cambridge", ] ) # Returns candidates ranked by relevance ``` ## Training **Base Model:** [jinaai/jina-reranker-v2-base-multilingual](https://huggingface.co/jinaai/jina-reranker-v2-base-multilingual) **Dataset:** [cometadata/triplet-loss-for-embedding-affiliations-sample-1](https://huggingface.co/datasets/cometadata/triplet-loss-for-embedding-affiliations-sample-1) - ~8K triplets (anchor, positive, negative) - 80% hard negatives (similar but different institutions) - 20% easy negatives (clearly different institutions) **Configuration:** | Parameter | Value | |-----------|-------| | Epochs | 3 | | Batch size | 16 | | Learning rate | 2e-5 | | Loss | BinaryCrossEntropyLoss | | Validation split | 15% | ## Evaluation Evaluated on 300 test cases across 10 difficulty tiers: | Tier | Cases | Base Model | Fine-tuned | Δ | |------|-------|------------|------------|---| | Baseline | 30 | 100.0% | 100.0% | — | | OCR/Noise | 30 | 100.0% | 100.0% | — | | Abbreviations | 40 | 60.0% | 80.0% | +20.0% | | Hierarchical | 35 | 71.4% | 77.1% | +5.7% | | Medical/Hospital | 25 | 64.0% | 68.0% | +4.0% | | Research Labs | 25 | 80.0% | 84.0% | +4.0% | | International | 35 | 82.9% | 91.4% | +8.6% | | Disambiguation | 31 | 45.2% | 51.6% | +6.5% | | Negative Controls | 19 | 100.0% | 100.0% | — | | Ultra-Hard | 30 | 93.3% | 96.7% | +3.3% | **Overall:** 78.3% → 84.3% accuracy (+6.0%), MRR 0.873 → 0.913 ## Model Details - **Parameters:** 278M - **Max sequence length:** 1024 tokens - **Output:** Single relevance score (0-1) - **Languages:** Multilingual (inherits from base model) ## License CC-BY-NC-4.0 (inherited from base model - non-commercial use only) ## Citation ```bibtex @misc{jina-affiliation-reranker, title={Jina Affiliation Reranker}, author={cometadata}, year={2024}, publisher={Hugging Face}, url={https://huggingface.co/cometadata/jina-reranker-v2-multilingual-affiliations} } ```