--- language: - en license: mit library_name: transformers tags: - reranking - information-retrieval - listwise - generative - llama - chain-of-thought base_model: meta-llama/Llama-3.1-8B datasets: - abdoelsayed/DeAR-COT pipeline_tag: text-generation --- # DeAR-8B-Reranker-Listwise-v1 ## Model Description **DeAR-8B-Reranker-Listwise-v1** is an 8B parameter listwise neural reranker that generates document rankings through text generation. Unlike pointwise models that score documents independently, this model considers multiple documents simultaneously and produces rankings with Chain-of-Thought reasoning. ## Model Details - **Model Type:** Listwise Reranker (Causal Language Model) - **Base Model:** LLaMA-3.1-8B - **Parameters:** 8 billion - **Training Method:** Supervised Fine-tuning with Chain-of-Thought - **Training Data:** [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT) - **Training Framework:** LLaMA-Factory - **Precision:** BFloat16 ## Key Features ✅ **Listwise Ranking:** Considers inter-document dependencies ✅ **Chain-of-Thought:** Generates reasoning for ranking decisions ✅ **State-of-the-Art:** Best performance on NovelEval (90.97 NDCG@10) ✅ **Flexible:** Handles variable numbers of documents ✅ **Interpretable:** Provides explanations for rankings ## Performance | Benchmark | NDCG@10 | vs. GPT-4 | |-----------|---------|-----------| | TREC DL19 | 77.91 | +2.32 | | TREC DL20 | 75.63 | +5.07 | | NovelEval | **90.97** | **+3.09** | | BEIR (Avg) | 46.8 | +2.3 | **Key Achievement:** Outperforms GPT-4 on NovelEval by +3.09 points! ## Usage ### Quick Start ```python import torch from transformers import AutoTokenizer, AutoModelForCausalLM # Load model model_path = "abdoelsayed/dear-8b-reranker-listwise-v1" tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map="auto" ) if tokenizer.pad_token is None: tokenizer.pad_token = tokenizer.eos_token # Prepare input query = "When did Thomas Edison invent the light bulb?" documents = [ "Lightning strike at Seoul National University", "Thomas Edison tried to invent a device for car but failed", "Coffee is good for diet", "KEPCO fixes light problems", "Thomas Edison invented the light bulb in 1879", ] # Create listwise prompt doc_list = "\n".join([f"[{i}] {doc}" for i, doc in enumerate(documents)]) prompt = f"""I will provide you with {len(documents)} passages, each indicated by a number identifier []. Rank the passages based on their relevance to the search query: {query}. {doc_list} Search Query: {query}. Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers.""" # Generate ranking inputs = tokenizer(prompt, return_tensors="pt", truncation=True, max_length=2048) inputs = {k: v.to(model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = model.generate( **inputs, max_new_tokens=50, temperature=0.7, do_sample=False, pad_token_id=tokenizer.pad_token_id ) ranking_text = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True) print(f"Ranking: {ranking_text}") # Output: [4] > [1] > [0] > [3] > [2] ``` ### Complete Reranking Pipeline ```python import torch from typing import List from transformers import AutoTokenizer, AutoModelForCausalLM import re class ListwiseReranker: def __init__(self, model_path: str, device: str = "auto"): self.tokenizer = AutoTokenizer.from_pretrained(model_path, use_fast=True) self.model = AutoModelForCausalLM.from_pretrained( model_path, torch_dtype=torch.bfloat16, device_map=device, low_cpu_mem_usage=True ) if self.tokenizer.pad_token is None: self.tokenizer.pad_token = self.tokenizer.eos_token def create_prompt(self, query: str, documents: List[str], max_doc_len: int = 300) -> str: """Create listwise ranking prompt.""" doc_list = "\n".join([f"[{i}] {doc[:max_doc_len]}" for i, doc in enumerate(documents)]) prompt = f"""I will provide you with {len(documents)} passages, each indicated by a number identifier []. Rank the passages based on their relevance to the search query: {query}. {doc_list} Search Query: {query}. Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers.""" return prompt def parse_ranking(self, output_text: str, num_docs: int) -> List[int]: """Parse model output to extract ranking.""" # Extract numbers from output numbers = re.findall(r'\[(\d+)\]', output_text) numbers = [int(n) for n in numbers if int(n) < num_docs] # Add missing documents at the end ranked = numbers.copy() for i in range(num_docs): if i not in ranked: ranked.append(i) return ranked[:num_docs] def rerank( self, query: str, documents: List[str], max_new_tokens: int = 50, temperature: float = 0.7 ) -> List[int]: """ Rerank documents for a query. Args: query: Search query documents: List of document texts max_new_tokens: Max tokens to generate temperature: Sampling temperature Returns: List of document indices ranked by relevance """ prompt = self.create_prompt(query, documents) inputs = self.tokenizer( prompt, return_tensors="pt", truncation=True, max_length=2048 ) inputs = {k: v.to(self.model.device) for k, v in inputs.items()} with torch.no_grad(): outputs = self.model.generate( **inputs, max_new_tokens=max_new_tokens, temperature=temperature, do_sample=False, pad_token_id=self.tokenizer.pad_token_id ) output_text = self.tokenizer.decode( outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True ) ranking = self.parse_ranking(output_text, len(documents)) return ranking # Example usage reranker = ListwiseReranker("abdoelsayed/dear-8b-reranker-listwise-v1") query = "What are the health benefits of green tea?" documents = [ "Green tea is a popular beverage in Asian countries.", "Studies show green tea contains antioxidants that may reduce inflammation.", "Coffee is another caffeinated drink consumed worldwide.", "Green tea has been linked to improved brain function and fat loss.", "The weather today is sunny and warm.", ] ranking = reranker.rerank(query, documents) print(f"Ranked indices: {ranking}") # Output: [1, 3, 0, 2, 4] # Display ranked documents for rank, idx in enumerate(ranking, 1): print(f"{rank}. {documents[idx]}") ``` ## Training Details ### Training Data - **Dataset:** [DeAR-COT](https://huggingface.co/datasets/abdoelsayed/DeAR-COT) - **Format:** Instruction-following with ranking outputs ### Training Configuration ```yaml model_name: meta-llama/Llama-3.1-8B task_type: sft training_method: listwise_ranking framework: LLaMA-Factory hyperparameters: learning_rate: 1e-5 batch_size: 4 gradient_accumulation: 4 epochs: 2 max_length: 2048 warmup_ratio: 0.1 weight_decay: 0.01 optimizer: adamw_torch lr_scheduler: cosine distributed: method: torch.distributed.run num_gpus: 4 deepspeed: zero2 ``` ### Hardware - **GPUs:** 4x NVIDIA A100 (80GB) - **Training Time:** ~30 hours - **Framework:** LLaMA-Factory with DeepSpeed - **Memory Usage:** ~70GB per GPU ### Prompt Format **Training Format:** ``` I will provide you with {N} passages, each indicated by a number identifier []. Rank the passages based on their relevance to the search query: {query}. [0] {doc_0} [1] {doc_1} ... [N-1] {doc_N-1} Search Query: {query}. Rank the passages above based on their relevance to the search query. Output the ranking as a list of numbers. Answer: [most_relevant] > [second] > ... > [least_relevant] ``` ## Evaluation Results ### TREC Deep Learning | Method | DL19 (NDCG@10) | DL20 (NDCG@10) | Average | |--------|----------------|----------------|---------| | BM25 | 50.58 | 47.96 | 49.27 | | RankGPT-4 | 75.59 | 70.56 | 73.08 | | **DeAR-L-8B** | **77.91** | **75.63** | **76.77** | ### NovelEval-2306 (Novel Query Generalization) | Method | NDCG@1 | NDCG@5 | NDCG@10 | Average | |--------|--------|--------|---------|---------| | BM25 | 33.33 | 45.96 | 55.77 | 45.02 | | RankGPT-4 | 85.71 | 87.49 | 90.45 | 87.88 | | **DeAR-L-8B** | **92.86** | **88.04** | **92.01** | **90.97** | 🏆 **+3.09 points better than GPT-4 on NovelEval!** ### BEIR Benchmark | Dataset | NDCG@10 | |---------|---------| | MS MARCO | 70.2 | | NQ | 54.1 | | HotpotQA | 64.5 | | FiQA | 49.3 | | ArguAna | 62.1 | | SciFact | 76.2 | | TREC-COVID | 88.4 | | NFCorpus | 40.6 | | **Average** | **46.8** | ### Efficiency Analysis | Metric | Value | |--------|-------| | Inference Time (20 docs) | 11.16s | | Throughput | ~1.8 docs/sec | | GPU Memory (inference) | 22GB | | Model Size (BF16) | 16GB | **Comparison with Other Methods:** - **2.2x faster** than RankGPT-4 (24.5s) - **1.9x faster** than RankZephyr (21.6s) - Similar performance with much better efficiency ## Advantages over Pointwise Models | Aspect | Pointwise | Listwise (This Model) | |--------|-----------|----------------------| | Document Interaction | ❌ Independent | ✅ Considers relationships | | Reasoning | ❌ None | ✅ Chain-of-Thought | | Novel Queries | Good | ✅ **Excellent** (+3-5 NDCG@10) | | Interpretability | ❌ Score only | ✅ Reasoning provided | | Speed | ✅ Very Fast (2.2s) | Moderate (11.2s) | ## Model Architecture ``` Input: Listwise Prompt with Query + Multiple Documents ↓ LLaMA-3.1-8B Decoder ↓ Auto-regressive Generation ↓ Output: "[4] > [1] > [0] > [3] > [2]" ↓ Parse to Ranking: [4, 1, 0, 3, 2] ``` ## When to Use This Model **Best for:** - ✅ Novel/complex queries requiring reasoning - ✅ Tasks where interpretability matters - ✅ Small candidate sets (<100 documents) - ✅ Research and analysis applications **Consider pointwise models for:** - ❌ Large-scale reranking (1000s of docs) - ❌ Real-time, low-latency applications - ❌ When reasoning is not needed ## Limitations 1. **Inference Speed:** Slower than pointwise models (~5x) 2. **Document Count:** Limited by context length (~20-50 docs optimal) 3. **Parsing Errors:** May occasionally generate malformed rankings 4. **Cost:** Higher computational cost for generation 5. **Language:** English only ## Bias and Ethical Considerations - **Position Bias:** May favor documents in certain positions - **Training Data Bias:** Inherits biases from CoT annotations - **Reasoning Artifacts:** Generated explanations may contain hallucinations - **Fairness:** Should be evaluated for fairness in your domain ## Related Models **DeAR Listwise:** - [DeAR-8B-Listwise-LoRA](https://huggingface.co/abdoelsayed/dear-8b-reranker-listwise-lora-v1) - LoRA adapter version **DeAR Pointwise (8B):** - [DeAR-8B-RankNet](https://huggingface.co/abdoelsayed/dear-8b-reranker-ranknet-v1) - [DeAR-8B-CE](https://huggingface.co/abdoelsayed/dear-8b-reranker-ce-v1) **Resources:** - [DeAR-COT Dataset](https://huggingface.co/datasets/abdoelsayed/DeAR-COT) - [Teacher Model](https://huggingface.co/abdoelsayed/llama2-13b-rankllama-teacher) ## Citation ```bibtex @article{abdallah2025dear, title={DeAR: Dual-Stage Document Reranking with Reasoning Agents via LLM Distillation}, author={Abdallah, Abdelrahman and Mozafari, Jamshid and Piryani, Bhawna and Jatowt, Adam}, journal={arXiv preprint arXiv:2508.16998}, year={2025} } ``` ## License MIT License ## More Information - **GitHub:** [DataScienceUIBK/DeAR-Reranking](https://github.com/DataScienceUIBK/DeAR-Reranking) - **Paper:** [arXiv:2508.16998](https://arxiv.org/abs/2508.16998) - **Collection:** [DeAR Models](https://huggingface.co/collections/abdoelsayed/dear-reranking)