# RAG API Design: Retrieval Architecture This document focuses specifically on the API layer designed to **retrieve** data from our existing, highly optimized data pipeline. Because the heavy lifting of processing, vectorization (BGE-M3 Dense + Sparse), and indexing is already handled by the Kafka and Qdrant workers, this API is designed purely for **scalable, high-performance retrieval and generation**. --- ## 🎯 1. Core API Philosophy The RAG API acts as the bridge between user queries (from the frontend) and the populated Qdrant vector database. 1. **Read-Only Operations:** This API does *not* write to Qdrant or ClickHouse. It assumes the databases are already hydrated by the Kafka workers. 2. **Symmetry with Ingestion:** The API must use the exact same BGE-M3 model for hashing user queries that the Embedding Service uses to hash news articles. 3. **Statelessness:** The API nodes hold no session state, allowing infinite horizontal scaling behind a Load Balancer. --- ## 🌐 2. Core API Endpoints ### 2.1 `POST /api/v1/search` (Hybrid Search Only) * **Purpose:** The fastest way to find relevant articles without generating an LLM response. Useful for standard "News Search" bars. * **Input Request:** ```json { "query": "Quantum computing breakthroughs in 2026", "limit": 10, "filters": { "source": ["TechCrunch", "Wired"], "date_range": { "start": "2026-01-01", "end": "2026-12-31" } } } ``` * **Internal Flow:** 1. Passes the `query` text through the BGE-M3 Tokenizer & Model (synchronously or via lightweight async executor). 2. Extracts the `Dense` vector (1024-dim) and `Sparse` lexical weights. 3. Queries Qdrant using a `Prefetch` query (combining Dense + Sparse scoring). 4. Extracts the Qdrant `payload` (article metadata) and returns it. * **Response:** A JSON list of articles sorted by relevance score. ### 2.2 `POST /api/v1/rag/ask` (Full RAG Flow) * **Purpose:** The endpoint for natural language Q&A. This hits Qdrant first, then sends the context to the LLM. * **Input Request:** ```json { "question": "What did Google recently announce regarding quantum processors?", "stream": true, // Critical for UX "top_k": 5 } ``` * **Internal Flow:** 1. **Retrieve:** Performs the exact same Hybrid Search as `/api/v1/search` to get the top 5 article chunks. 2. **Prompt Assembly:** Constructs a structured prompt template: `"Use the following news articles to answer the question...\n\nCONTEXT:\n[Article 1 Text...]\n[Article 2 Text...]\n\nQUESTION: What did Google recently announce..."` 3. **Generate:** Sends the assembled prompt to the LLM (OpenAI, local Llama-3, etc.). 4. **Stream:** Uses Server-Sent Events (SSE) to yield tokens to the frontend as they are generated. --- ## 🧠 3. Query Vectorization Pipeline (Symmetry) For Qdrant search to work perfectly, the API must emulate Step 4 of the *Data Flow Pipeline* exactly. ```python # RAG API Vectorization Logic def vectorize_query(query_text: str): # Uses the SAME FlagEmbedding configuration as the ingestor embeddings = model.encode( sentences=[query_text], batch_size=1, max_length=512, # Queries are shorter than articles return_dense=True, return_sparse=True, return_colbert_vecs=False ) return { "dense": embeddings['dense_vecs'][0].tolist(), "sparse": { "indices": list(embeddings['lexical_weights'][0].keys()), "values": list(embeddings['lexical_weights'][0].values()) } } ``` --- ## ⚡ 4. Scalability at the Retrieval Layer Since the Heavy ETL is done by the pipelines, the API's main bottleneck is **waiting** for Qdrant and the LLM. ### 4.1 Async FastAPI * The API is built purely on `async def` endpoints. * When the API queries Qdrant (`await qdrant_client.async_search(...)`), it yields the thread back to the event loop. * A single FastAPI container can handle thousands of concurrent searches while waiting for Qdrant to respond. ### 4.2 Semantic Query Caching (Redis) To save LLM compute and Qdrant load: * We implement Redis **Semantic Caching**. * If User A asks: *"What is Tesla's stock doing?"* and User B asks *"How is the Tesla stock performing?"*, the semantic cache recognizes the queries are identical in meaning (High Cosine Similarity) and instantly returns User A's cached LLM response to User B. ### 4.3 Streaming (SSE) for LLMs * Generating a 500-word RAG answer might take the LLM 3 seconds. Instead of a loading spinner for 3 seconds, the API uses `StreamingResponse`. The user sees the first word in 200ms, creating a "Real-Time" feel. --- ## 📊 5. Integration with Pipeline Analytics If the RAG API needs to answer questions like *"How many articles mentioned AI today?"*, it should NOT query Qdrant. Qdrant is a Vector Search engine, not an Analytics database. For structured analytics, the API connects directly to **ClickHouse** (which the Kafka `sink` worker hydrates), allowing real-time aggregations without disturbing the vector search performance.