pharmaspine-backend / KAGGLE_INGESTION_GUIDE.md
ashish1265659565's picture
Upload folder using huggingface_hub
08fd094 verified
|
Raw
History Blame
9.03 kB

Kaggle Data Ingestion Guide (T4 x2 GPUs)

This guide provides the exact steps and Python code to run your data ingestion pipeline on Kaggle using Docling for extraction, Markdown + Recursive Chunking, MedCPT for embeddings, and pushing directly to your Qdrant Cloud cluster.

Step 1: Kaggle Notebook Setup

  1. Create a new notebook on Kaggle.
  2. Go to Settings (right-side panel) -> Accelerator -> Select GPU T4 x2.
  3. Turn on Internet Access in the settings.
  4. Upload your medical documents (PDFs, docs) to the Kaggle notebook by clicking Add Data -> Upload.

Step 2: Install Required Libraries

Run this in the first cell of your Kaggle notebook:

!pip install -q "docling" langchain langchain-community langchain-huggingface qdrant-client sentence-transformers textstat

Step 3: Import Libraries & Configure Environment

Run this in the second cell. Replace the Qdrant API Key with your actual credential from your .env file.

import os
from pathlib import Path
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import uuid

# Configuration
QDRANT_URL = "https://e4f37189-cb62-4a77-a55e-1c9d98082be7.eu-west-2-0.aws.cloud.qdrant.io:6333"
QDRANT_API_KEY = "YOUR_QDRANT_API_KEY" # Paste from your .env
COLLECTION_NAME = "medical_knowledge_base"

# Your uploaded dataset path on Kaggle (change this based on your dataset name)
DATA_DIR = "/kaggle/input/your-medical-dataset-name" 

Step 4: Extract Data using Docling

Docling is amazing at extracting text, tables, and structures from PDFs.

def extract_documents(data_dir):
    converter = DocumentConverter()
    extracted_docs = []
    
    # Iterate through all PDFs in your Kaggle dataset
    for filepath in Path(data_dir).glob("**/*.pdf"):
        print(f"Extracting: {filepath.name}")
        result = converter.convert(str(filepath))
        
        # Export Docling result to Markdown format
        markdown_content = result.document.export_to_markdown()
        extracted_docs.append({
            "source": filepath.name,
            "content": markdown_content
        })
    return extracted_docs

print("Starting Document Extraction...")
docs = extract_documents(DATA_DIR)
print(f"Successfully extracted {len(docs)} documents.")

Step 5: Advanced Semantic Chunking (Markdown + Recursive)

We first split the document logically by headers, then chunk the remaining text to fit the 512 token limit with a 64 token overlap.

def chunk_documents(docs):
    # 1. Split logically by Markdown headers
    headers_to_split_on = [
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
    markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
    
    # 2. Strict character splitting to guarantee sizing
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=512,
        chunk_overlap=64,
        separators=["\n\n", "\n", ".", " ", ""]
    )
    
    chunks = []
    for doc in docs:
        # Split by headers
        md_splits = markdown_splitter.split_text(doc["content"])
        
        # Further split chunks that are too large
        for md_split in md_splits:
            final_splits = text_splitter.split_text(md_split.page_content)
            for i, split in enumerate(final_splits):
                chunks.append({
                    "chunk_id": str(uuid.uuid4()),
                    "source": doc["source"],
                    "text": split,
                    "metadata": md_split.metadata # Preserves header information
                })
    return chunks

print("Chunking documents...")
chunks = chunk_documents(docs)
print(f"Created {len(chunks)} raw chunks.")

Step 5.5: Validate & Score Chunk Coherence

Not all text extracted from PDFs is useful (e.g., garbled OCR, random numbers). We use textstat to calculate a coherence/readability score for each chunk. We will filter out completely broken chunks and attach the score to the valid ones.

import textstat
import hashlib
import uuid

def score_and_filter_chunks(chunks):
    valid_chunks = []
    
    for chunk in chunks:
        text = chunk["text"]
        
        # 1. Reject chunks that are too small to have context
        if len(text.strip()) < 50:
            continue
            
        # 2. Calculate Coherence / Readability Score (Flesch Reading Ease)
        raw_score = textstat.flesch_reading_ease(text)
        
        # Keep only chunks with a positive score, and normalize it between 0.0 and 1.0
        if raw_score > 0:
            normalized_score = min(1.0, raw_score / 100.0)
            chunk["metadata"]["coherence_score"] = round(normalized_score, 4)
            
            # 3. Generate a deterministic ID based on text so duplicates never happen
            deterministic_id = hashlib.md5(text.encode('utf-8')).hexdigest()
            chunk["chunk_id"] = str(uuid.UUID(deterministic_id))
            
            valid_chunks.append(chunk)
            
    return valid_chunks

print("Validating chunks and calculating coherence scores...")
scored_chunks = score_and_filter_chunks(chunks)
print(f"Kept {len(scored_chunks)} highly coherent chunks (Filtered out {len(chunks) - len(scored_chunks)} bad chunks).")

# Replace chunks with our scored and filtered list
chunks = scored_chunks

Step 6: Initialize MedCPT Article Encoder (Using GPU)

Kaggle's T4 GPUs will load the ncbi/MedCPT-Article-Encoder. This model is specifically trained on PubMed articles and clinical notes!

print("Loading MedCPT Article Encoder onto T4 GPUs...")
# model_kwargs={'device': 'cuda'} forces the model to use the GPUs
embeddings_model = HuggingFaceEmbeddings(
    model_name="ncbi/MedCPT-Article-Encoder", 
    model_kwargs={'device': 'cuda'} 
)

# MedCPT outputs 768 dimensional vectors
VECTOR_SIZE = 768 

Step 7: Push Embeddings to Qdrant Cloud

This script embeds the chunks and pushes them over the internet directly to your Qdrant Cloud cluster.

# Initialize Qdrant Client connected to your Cloud cluster
client = QdrantClient(
    url=QDRANT_URL, 
    api_key=QDRANT_API_KEY
)

# Create the collection if it doesn't exist
if not client.collection_exists(COLLECTION_NAME):
    client.create_collection(
        collection_name=COLLECTION_NAME,
        vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
    )
    print(f"Created new collection: {COLLECTION_NAME}")

print("Embedding and pushing to Qdrant in batches...")
BATCH_SIZE = 64 # Use 64 to maximize GPU usage

for i in range(0, len(chunks), BATCH_SIZE):
    batch = chunks[i:i + BATCH_SIZE]
    
    # Generate embeddings using MedCPT Article Encoder
    texts = [item["text"] for item in batch]
    batch_embeddings = embeddings_model.embed_documents(texts)
    
    points = []
    for j, item in enumerate(batch):
        points.append(
            PointStruct(
                id=item["chunk_id"],
                vector=batch_embeddings[j],
                payload={
                    "source": item["source"],
                    "text": item["text"],
                    "headers": item["metadata"],
                    "coherence_score": item["metadata"].get("coherence_score", 0)
                }
            )
        )
    
    client.upsert(collection_name=COLLECTION_NAME, points=points)
    print(f"Pushed chunks {i} to {i + len(batch)} / {len(chunks)}...")

print("✅ Data Ingestion Pipeline Complete! Your vectors are now live in Qdrant Cloud.")

What to do AFTER Ingestion? (Merging into Local Directory)

Because Qdrant is hosted in the Cloud, you do not need to download or merge any database files back into your local directory! The vectors are instantly available globally.

However, you must update your local backend project to use the matching MedCPT-Query-Encoder so it can search properly.

  1. Update your .env file in your local project to swap the embedding model:

    # Change embedding model from qwen3-embedding to MedCPT Query Encoder
    OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
    GW_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
    AKS_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder
    

    (Note: You will also need to pull this model locally via Ollama or HuggingFace locally, or configure your backend retrieval.py to use HuggingFaceEmbeddings instead of Ollama for the Query Encoder).

  2. Refactor the Retrieval Layer: Once the ingestion is complete, inform your AI assistant so it can update src/retrieval.py to search qdrant-client using the new MedCPT-Query-Encoder instead of the old PostgreSQL pgvector code.