# Kaggle Data Ingestion Guide (T4 x2 GPUs) This guide provides the exact steps and Python code to run your data ingestion pipeline on Kaggle using **Docling** for extraction, **Markdown + Recursive Chunking**, **MedCPT** for embeddings, and pushing directly to your **Qdrant Cloud** cluster. ## Step 1: Kaggle Notebook Setup 1. Create a new notebook on Kaggle. 2. Go to **Settings** (right-side panel) -> **Accelerator** -> Select **GPU T4 x2**. 3. Turn on **Internet Access** in the settings. 4. Upload your medical documents (PDFs, docs) to the Kaggle notebook by clicking **Add Data** -> **Upload**. --- ## Step 2: Install Required Libraries *Run this in the first cell of your Kaggle notebook:* ```python !pip install -q "docling" langchain langchain-community langchain-huggingface qdrant-client sentence-transformers textstat ``` --- ## Step 3: Import Libraries & Configure Environment *Run this in the second cell. Replace the Qdrant API Key with your actual credential from your `.env` file.* ```python import os from pathlib import Path from docling.document_converter import DocumentConverter from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter from langchain_huggingface import HuggingFaceEmbeddings from qdrant_client import QdrantClient from qdrant_client.models import VectorParams, Distance, PointStruct import uuid # Configuration QDRANT_URL = "https://e4f37189-cb62-4a77-a55e-1c9d98082be7.eu-west-2-0.aws.cloud.qdrant.io:6333" QDRANT_API_KEY = "YOUR_QDRANT_API_KEY" # Paste from your .env COLLECTION_NAME = "medical_knowledge_base" # Your uploaded dataset path on Kaggle (change this based on your dataset name) DATA_DIR = "/kaggle/input/your-medical-dataset-name" ``` --- ## Step 4: Extract Data using Docling *Docling is amazing at extracting text, tables, and structures from PDFs.* ```python def extract_documents(data_dir): converter = DocumentConverter() extracted_docs = [] # Iterate through all PDFs in your Kaggle dataset for filepath in Path(data_dir).glob("**/*.pdf"): print(f"Extracting: {filepath.name}") result = converter.convert(str(filepath)) # Export Docling result to Markdown format markdown_content = result.document.export_to_markdown() extracted_docs.append({ "source": filepath.name, "content": markdown_content }) return extracted_docs print("Starting Document Extraction...") docs = extract_documents(DATA_DIR) print(f"Successfully extracted {len(docs)} documents.") ``` --- ## Step 5: Advanced Semantic Chunking (Markdown + Recursive) *We first split the document logically by headers, then chunk the remaining text to fit the 512 token limit with a 64 token overlap.* ```python def chunk_documents(docs): # 1. Split logically by Markdown headers headers_to_split_on = [ ("#", "Header 1"), ("##", "Header 2"), ("###", "Header 3"), ] markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on) # 2. Strict character splitting to guarantee sizing text_splitter = RecursiveCharacterTextSplitter( chunk_size=512, chunk_overlap=64, separators=["\n\n", "\n", ".", " ", ""] ) chunks = [] for doc in docs: # Split by headers md_splits = markdown_splitter.split_text(doc["content"]) # Further split chunks that are too large for md_split in md_splits: final_splits = text_splitter.split_text(md_split.page_content) for i, split in enumerate(final_splits): chunks.append({ "chunk_id": str(uuid.uuid4()), "source": doc["source"], "text": split, "metadata": md_split.metadata # Preserves header information }) return chunks print("Chunking documents...") chunks = chunk_documents(docs) print(f"Created {len(chunks)} raw chunks.") ``` --- ## Step 5.5: Validate & Score Chunk Coherence *Not all text extracted from PDFs is useful (e.g., garbled OCR, random numbers). We use `textstat` to calculate a coherence/readability score for each chunk. We will filter out completely broken chunks and attach the score to the valid ones.* ```python import textstat import hashlib import uuid def score_and_filter_chunks(chunks): valid_chunks = [] for chunk in chunks: text = chunk["text"] # 1. Reject chunks that are too small to have context if len(text.strip()) < 50: continue # 2. Calculate Coherence / Readability Score (Flesch Reading Ease) raw_score = textstat.flesch_reading_ease(text) # Keep only chunks with a positive score, and normalize it between 0.0 and 1.0 if raw_score > 0: normalized_score = min(1.0, raw_score / 100.0) chunk["metadata"]["coherence_score"] = round(normalized_score, 4) # 3. Generate a deterministic ID based on text so duplicates never happen deterministic_id = hashlib.md5(text.encode('utf-8')).hexdigest() chunk["chunk_id"] = str(uuid.UUID(deterministic_id)) valid_chunks.append(chunk) return valid_chunks print("Validating chunks and calculating coherence scores...") scored_chunks = score_and_filter_chunks(chunks) print(f"Kept {len(scored_chunks)} highly coherent chunks (Filtered out {len(chunks) - len(scored_chunks)} bad chunks).") # Replace chunks with our scored and filtered list chunks = scored_chunks ``` --- ## Step 6: Initialize MedCPT Article Encoder (Using GPU) *Kaggle's T4 GPUs will load the `ncbi/MedCPT-Article-Encoder`. This model is specifically trained on PubMed articles and clinical notes!* ```python print("Loading MedCPT Article Encoder onto T4 GPUs...") # model_kwargs={'device': 'cuda'} forces the model to use the GPUs embeddings_model = HuggingFaceEmbeddings( model_name="ncbi/MedCPT-Article-Encoder", model_kwargs={'device': 'cuda'} ) # MedCPT outputs 768 dimensional vectors VECTOR_SIZE = 768 ``` --- ## Step 7: Push Embeddings to Qdrant Cloud *This script embeds the chunks and pushes them over the internet directly to your Qdrant Cloud cluster.* ```python # Initialize Qdrant Client connected to your Cloud cluster client = QdrantClient( url=QDRANT_URL, api_key=QDRANT_API_KEY ) # Create the collection if it doesn't exist if not client.collection_exists(COLLECTION_NAME): client.create_collection( collection_name=COLLECTION_NAME, vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE), ) print(f"Created new collection: {COLLECTION_NAME}") print("Embedding and pushing to Qdrant in batches...") BATCH_SIZE = 64 # Use 64 to maximize GPU usage for i in range(0, len(chunks), BATCH_SIZE): batch = chunks[i:i + BATCH_SIZE] # Generate embeddings using MedCPT Article Encoder texts = [item["text"] for item in batch] batch_embeddings = embeddings_model.embed_documents(texts) points = [] for j, item in enumerate(batch): points.append( PointStruct( id=item["chunk_id"], vector=batch_embeddings[j], payload={ "source": item["source"], "text": item["text"], "headers": item["metadata"], "coherence_score": item["metadata"].get("coherence_score", 0) } ) ) client.upsert(collection_name=COLLECTION_NAME, points=points) print(f"Pushed chunks {i} to {i + len(batch)} / {len(chunks)}...") print("✅ Data Ingestion Pipeline Complete! Your vectors are now live in Qdrant Cloud.") ``` --- ## What to do AFTER Ingestion? (Merging into Local Directory) Because Qdrant is hosted in the Cloud, **you do not need to download or merge any database files back into your local directory!** The vectors are instantly available globally. However, you **must update your local backend project** to use the matching `MedCPT-Query-Encoder` so it can search properly. 1. **Update your `.env` file** in your local project to swap the embedding model: ```env # Change embedding model from qwen3-embedding to MedCPT Query Encoder OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder GW_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder AKS_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder ``` *(Note: You will also need to pull this model locally via Ollama or HuggingFace locally, or configure your backend `retrieval.py` to use HuggingFaceEmbeddings instead of Ollama for the Query Encoder).* 2. **Refactor the Retrieval Layer**: Once the ingestion is complete, inform your AI assistant so it can update `src/retrieval.py` to search `qdrant-client` using the new `MedCPT-Query-Encoder` instead of the old PostgreSQL `pgvector` code.