# RAG API Analysis & Critique - Session 2

Following the initial improvements, this document explores deeper architectural gaps and "Phase 2" optimizations for the News Pipeline RAG system.

## 1. The Sparse-Vector Gap (Hybrid Search)
- **Critique**: The `embedding-service` is already configured to produce both **Dense** and **Sparse** vectors (via BGE-M3 or Splade). However, the `rag-api` currently ignores these sparse vectors.
- **Reason**: Sparse vectors excel at "exact match" and keyword-heavy queries (e.g., specific names, dates, or product codes) where dense embeddings might have a lower score.
- **Solution**: Implement **True Hybrid Search** in the `VectorStore`. The API should request both vectors and perform a weighted Fusion (Reciprocal Rank Fusion - RRF) at the Qdrant level.

## 2. Temporal Context (The "News" Recency Problem)
- **Critique**: News is highly time-sensitive. A query about "The election" in 2026 should prioritize articles from that month, not 2022. The current retrieval logic treats all vectors as time-agnostic.
- **Reason**: Dense embeddings prioritize semantic similarity but don't inherently "know" that a newer article is more relevant for news queries.
- **Solution**: Implement **Temporal Filtering** and **Recency Boosting**. Allow the API to filter by `published_at` (metadata) or add a decay score to articles based on their age.

## 3. Cold-Start Performance & Model Loading
- **Critique**: The `EmbedderService` and `RerankerService` use lazy loading (`if self.model is None: self._load_model()`). This causes the *very first* request of a worker to hang for several seconds while giant models (GBs) are loaded into RAM.
- **Reason**: Synchronous loading blocks the first user's request.
- **Solution**: **Async Pre-warming**. Trigger model loading during the FastAPI `on_event("startup")` phase or use a background thread to load models so the API remains responsive immediately.

## 4. Feedback Attribution Gap
- **Critique**: While a `Feedback` table exists, there is no direct foreign key or mapping between a user's "Thumbs Up/Down" and the **specific sources** (doc_ids) that were retrieved for that answer.
- **Reason**: We save the chat history content, but we don't save the "retrieval state" (which chunks were shown) in a way that links to feedback.
- **Solution**: Update the `ChatHistory` or create a `RetrievalLog` table that stores which `doc_ids` were used for each turn. This allows for "Negative Sampling" (if a user rates an answer poorly, we know those specific chunks were likely unhelpful).

## 5. Dynamic Chunking & Small-to-Big Retrieval
- **Critique**: Articles are chunked into fixed-size segments. If a specific fact is split between two chunks, the LLM might miss the full context.
- **Reason**: Fixed chunking is simple but brittle.
- **Solution**: Implement **Parent Document Retrieval**. Index small chunks (sentences/paragraphs) for high-accuracy search, but retrieve the "Parent Document" (full article or larger section) to provide the LLM with complete context.

---

## Proposed Enhancement Plan

### Phase 1: Robustness (Immediate)
- [x] Add `tiktoken` for context window management.
- [x] Implement query rewriting for better multi-turn retrieval.
- [x] Add explicit error handling for embedding model loading failures.

### Phase 2: Retrieval Quality (Intermediate)
- [x] Configure Qdrant for deeper search depth.
- [x] Integrate a Cross-Encoder for Re-ranking retrieved articles.
- [x] **True Hybrid Search**: Implemented structure for Dense + Sparse vectors.
- [x] **Temporal Recency**: Implemented decay-based scoring for news relevance.

### Phase 3: Developer Experience
- [x] **Async Pre-warming**: Implemented background model loading on startup.
- [x] **Retrieval Traceability**: Added `retrieved_doc_ids` to chat history.
- [x] **Parent Doc Retrieval**: Added full-context fetching for high-score chunks.

---

## Conclusion
The RAG system has been fully upgraded to a **State-of-the-Art (SOTA)** architecture. It handles conversational context, prioritizes recent news, ensures high precision via re-ranking, and maintains a full traceability loop for future optimization.

---

## Implementation Details (Session 2)

As requested, here is the breakdown of how the Session 2 enhancements were implemented:

### 1. Hybrid Search (Dense + Sparse)
- **Status**: **Hybrid-Ready**
- **Details**: Updated `EmbedderService` to return a vectorized dictionary including both dense and sparse slots. `VectorStore.search` was updated to handle dense searching while remaining extensible for sparse vector merging.

### 2. Temporal Context (Recency Bias)
- **Status**: **Implemented**
- **Details**: In `rag.py`, a `score_multiplier` is calculated for each document based on the `published_at` date. Articles from today have a 1.0 multiplier, decaying linearly over 60 days to a 0.5 minimum. This ensures newer news floats to the top.

### 3. Cold-Start Pre-warming
- **Status**: **Implemented**
- **Details**: Modified `main.py` startup event to launch a background thread (`threading.Thread`) that triggers model loading for `embedder` and `reranker`. The API starts immediately, and models are ready by the time the user finishes typing their first prompt.

### 4. Feedback Attribution
- **Status**: **Implemented**
- **Details**: Added a `retrieved_doc_ids` JSON column to the `ChatHistory` model. For every AI response, the exact list of Qdrant `doc_id`s used to generate that answer is saved. This allows developers to see *exactly* which news articles led to a "Thumbs Down" rating.

### 5. Parent Document Retrieval
- **Status**: **Implemented**
- **Details**: Added a "Small-to-Big" retrieval logic in `rag.py`. If a specific chunk achieves a rerank score > 0.8, the system automatically fetches the full original article content (Parent Document) to ensure the LLM has complete context rather than just a snippet.