# Kaggle Data Ingestion Guide (T4 x2 GPUs)

This guide provides the exact steps and Python code to run your data ingestion pipeline on Kaggle using **Docling** for extraction, **Markdown + Recursive Chunking**, **MedCPT** for embeddings, and pushing directly to your **Qdrant Cloud** cluster.

## Step 1: Kaggle Notebook Setup
1. Create a new notebook on Kaggle.
2. Go to **Settings** (right-side panel) -> **Accelerator** -> Select **GPU T4 x2**.
3. Turn on **Internet Access** in the settings.
4. Upload your medical documents (PDFs, docs) to the Kaggle notebook by clicking **Add Data** -> **Upload**.

---

## Step 2: Install Required Libraries
*Run this in the first cell of your Kaggle notebook:*

```python
!pip install -q "docling" langchain langchain-community langchain-huggingface qdrant-client sentence-transformers textstat
```

---

## Step 3: Import Libraries & Configure Environment
*Run this in the second cell. Replace the Qdrant API Key with your actual credential from your `.env` file.*

```python
import os
from pathlib import Path
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import uuid

# Configuration
QDRANT_URL = "https://e4f37189-cb62-4a77-a55e-1c9d98082be7.eu-west-2-0.aws.cloud.qdrant.io:6333"
QDRANT_API_KEY = "YOUR_QDRANT_API_KEY" # Paste from your .env
COLLECTION_NAME = "medical_knowledge_base"

# Your uploaded dataset path on Kaggle (change this based on your dataset name)
DATA_DIR = "/kaggle/input/your-medical-dataset-name" 
```

---

## Step 4: Extract Data using Docling
*Docling is amazing at extracting text, tables, and structures from PDFs.*

```python
def extract_documents(data_dir):
    converter = DocumentConverter()
    extracted_docs = []
    
    # Iterate through all PDFs in your Kaggle dataset
    for filepath in Path(data_dir).glob("**/*.pdf"):
        print(f"Extracting: {filepath.name}")
        result = converter.convert(str(filepath))
        
        # Export Docling result to Markdown format
        markdown_content = result.document.export_to_markdown()
        extracted_docs.append({
            "source": filepath.name,
            "content": markdown_content
        })
    return extracted_docs

print("Starting Document Extraction...")
docs = extract_documents(DATA_DIR)
print(f"Successfully extracted {len(docs)} documents.")
```

---

## Step 5: Advanced Semantic Chunking (Markdown + Recursive)
*We first split the document logically by headers, then chunk the remaining text to fit the 512 token limit with a 64 token overlap.*

```python
def chunk_documents(docs):
    # 1. Split logically by Markdown headers
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    # 2. Strict character splitting to guarantee sizing
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    chunks = []
    for doc in docs:
        # Split by headers
        md_splits = markdown_splitter.split_text(doc["content"])
        
        # Further split chunks that are too large
        for md_split in md_splits:
            final_splits = text_splitter.split_text(md_split.page_content)
            for i, split in enumerate(final_splits):
                chunks.append({
                    "chunk_id": str(uuid.uuid4()),
                    "source": doc["source"],
                    "text": split,
                    "metadata": md_split.metadata # Preserves header information
                })
    return chunks

print("Chunking documents...")
chunks = chunk_documents(docs)
print(f"Created {len(chunks)} raw chunks.")
```

---

## Step 5.5: Validate & Score Chunk Coherence
*Not all text extracted from PDFs is useful (e.g., garbled OCR, random numbers). We use `textstat` to calculate a coherence/readability score for each chunk. We will filter out completely broken chunks and attach the score to the valid ones.*

```python
import textstat
import hashlib
import uuid

def score_and_filter_chunks(chunks):
    valid_chunks = []
    
    for chunk in chunks:
        text = chunk["text"]
        
        # 1. Reject chunks that are too small to have context
        if len(text.strip()) < 50:
            continue
            
        # 2. Calculate Coherence / Readability Score (Flesch Reading Ease)
        raw_score = textstat.flesch_reading_ease(text)
        
        # Keep only chunks with a positive score, and normalize it between 0.0 and 1.0
        if raw_score > 0:
            normalized_score = min(1.0, raw_score / 100.0)
            chunk["metadata"]["coherence_score"] = round(normalized_score, 4)
            
            # 3. Generate a deterministic ID based on text so duplicates never happen
            deterministic_id = hashlib.md5(text.encode('utf-8')).hexdigest()
            chunk["chunk_id"] = str(uuid.UUID(deterministic_id))
            
            valid_chunks.append(chunk)
            
    return valid_chunks

print("Validating chunks and calculating coherence scores...")
scored_chunks = score_and_filter_chunks(chunks)
print(f"Kept {len(scored_chunks)} highly coherent chunks (Filtered out {len(chunks) - len(scored_chunks)} bad chunks).")

# Replace chunks with our scored and filtered list
chunks = scored_chunks
```

---

## Step 6: Initialize MedCPT Article Encoder (Using GPU)
*Kaggle's T4 GPUs will load the `ncbi/MedCPT-Article-Encoder`. This model is specifically trained on PubMed articles and clinical notes!*

```python
print("Loading MedCPT Article Encoder onto T4 GPUs...")
# model_kwargs={'device': 'cuda'} forces the model to use the GPUs
embeddings_model = HuggingFaceEmbeddings(
    model_name="ncbi/MedCPT-Article-Encoder", 
    model_kwargs={'device': 'cuda'} 
)

# MedCPT outputs 768 dimensional vectors
VECTOR_SIZE = 768 
```

---

## Step 7: Push Embeddings to Qdrant Cloud
*This script embeds the chunks and pushes them over the internet directly to your Qdrant Cloud cluster.*

```python
# Initialize Qdrant Client connected to your Cloud cluster
client = QdrantClient(
    url=QDRANT_URL, 
    api_key=QDRANT_API_KEY
)

# Create the collection if it doesn't exist
if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
    )
    print(f"Created new collection: {COLLECTION_NAME}")

print("Embedding and pushing to Qdrant in batches...")
BATCH_SIZE = 64 # Use 64 to maximize GPU usage

for i in range(0, len(chunks), BATCH_SIZE):
    batch = chunks[i:i + BATCH_SIZE]
    
    # Generate embeddings using MedCPT Article Encoder
    texts = [item["text"] for item in batch]
    batch_embeddings = embeddings_model.embed_documents(texts)
    
    points = []
    for j, item in enumerate(batch):
        points.append(
            PointStruct(
                id=item["chunk_id"],
                vector=batch_embeddings[j],
                payload={
                    "source": item["source"],
                    "text": item["text"],
                    "headers": item["metadata"],
                    "coherence_score": item["metadata"].get("coherence_score", 0)
                }
            )
        )
    
    client.upsert(collection_name=COLLECTION_NAME, points=points)
    print(f"Pushed chunks {i} to {i + len(batch)} / {len(chunks)}...")

print("✅ Data Ingestion Pipeline Complete! Your vectors are now live in Qdrant Cloud.")
```

---

## What to do AFTER Ingestion? (Merging into Local Directory)

Because Qdrant is hosted in the Cloud, **you do not need to download or merge any database files back into your local directory!** The vectors are instantly available globally.

However, you **must update your local backend project** to use the matching `MedCPT-Query-Encoder` so it can search properly.

1. **Update your `.env` file** in your local project to swap the embedding model:
   ```env
   # Change embedding model from qwen3-embedding to MedCPT Query Encoder
   OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
   GW_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
   AKS_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
   ```
   *(Note: You will also need to pull this model locally via Ollama or HuggingFace locally, or configure your backend `retrieval.py` to use HuggingFaceEmbeddings instead of Ollama for the Query Encoder).*

2. **Refactor the Retrieval Layer**: Once the ingestion is complete, inform your AI assistant so it can update `src/retrieval.py` to search `qdrant-client` using the new `MedCPT-Query-Encoder` instead of the old PostgreSQL `pgvector` code.