Kaggle Data Ingestion Guide (T4 x2 GPUs)
This guide provides the exact steps and Python code to run your data ingestion pipeline on Kaggle using Docling for extraction, Markdown + Recursive Chunking, MedCPT for embeddings, and pushing directly to your Qdrant Cloud cluster.
Step 1: Kaggle Notebook Setup
- Create a new notebook on Kaggle.
- Go to Settings (right-side panel) -> Accelerator -> Select GPU T4 x2.
- Turn on Internet Access in the settings.
- Upload your medical documents (PDFs, docs) to the Kaggle notebook by clicking Add Data -> Upload.
Step 2: Install Required Libraries
Run this in the first cell of your Kaggle notebook:
!pip install -q "docling" langchain langchain-community langchain-huggingface qdrant-client sentence-transformers textstat
Step 3: Import Libraries & Configure Environment
Run this in the second cell. Replace the Qdrant API Key with your actual credential from your .env file.
import os
from pathlib import Path
from docling.document_converter import DocumentConverter
from langchain_text_splitters import MarkdownHeaderTextSplitter, RecursiveCharacterTextSplitter
from langchain_huggingface import HuggingFaceEmbeddings
from qdrant_client import QdrantClient
from qdrant_client.models import VectorParams, Distance, PointStruct
import uuid
# Configuration
QDRANT_URL = "https://e4f37189-cb62-4a77-a55e-1c9d98082be7.eu-west-2-0.aws.cloud.qdrant.io:6333"
QDRANT_API_KEY = "YOUR_QDRANT_API_KEY" # Paste from your .env
COLLECTION_NAME = "medical_knowledge_base"
# Your uploaded dataset path on Kaggle (change this based on your dataset name)
DATA_DIR = "/kaggle/input/your-medical-dataset-name"
Step 4: Extract Data using Docling
Docling is amazing at extracting text, tables, and structures from PDFs.
def extract_documents(data_dir):
converter = DocumentConverter()
extracted_docs = []
# Iterate through all PDFs in your Kaggle dataset
for filepath in Path(data_dir).glob("**/*.pdf"):
print(f"Extracting: {filepath.name}")
result = converter.convert(str(filepath))
# Export Docling result to Markdown format
markdown_content = result.document.export_to_markdown()
extracted_docs.append({
"source": filepath.name,
"content": markdown_content
})
return extracted_docs
print("Starting Document Extraction...")
docs = extract_documents(DATA_DIR)
print(f"Successfully extracted {len(docs)} documents.")
Step 5: Advanced Semantic Chunking (Markdown + Recursive)
We first split the document logically by headers, then chunk the remaining text to fit the 512 token limit with a 64 token overlap.
def chunk_documents(docs):
# 1. Split logically by Markdown headers
headers_to_split_on = [
("#", "Header 1"),
("##", "Header 2"),
("###", "Header 3"),
]
markdown_splitter = MarkdownHeaderTextSplitter(headers_to_split_on=headers_to_split_on)
# 2. Strict character splitting to guarantee sizing
text_splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=64,
separators=["\n\n", "\n", ".", " ", ""]
)
chunks = []
for doc in docs:
# Split by headers
md_splits = markdown_splitter.split_text(doc["content"])
# Further split chunks that are too large
for md_split in md_splits:
final_splits = text_splitter.split_text(md_split.page_content)
for i, split in enumerate(final_splits):
chunks.append({
"chunk_id": str(uuid.uuid4()),
"source": doc["source"],
"text": split,
"metadata": md_split.metadata # Preserves header information
})
return chunks
print("Chunking documents...")
chunks = chunk_documents(docs)
print(f"Created {len(chunks)} raw chunks.")
Step 5.5: Validate & Score Chunk Coherence
Not all text extracted from PDFs is useful (e.g., garbled OCR, random numbers). We use textstat to calculate a coherence/readability score for each chunk. We will filter out completely broken chunks and attach the score to the valid ones.
import textstat
import hashlib
import uuid
def score_and_filter_chunks(chunks):
valid_chunks = []
for chunk in chunks:
text = chunk["text"]
# 1. Reject chunks that are too small to have context
if len(text.strip()) < 50:
continue
# 2. Calculate Coherence / Readability Score (Flesch Reading Ease)
raw_score = textstat.flesch_reading_ease(text)
# Keep only chunks with a positive score, and normalize it between 0.0 and 1.0
if raw_score > 0:
normalized_score = min(1.0, raw_score / 100.0)
chunk["metadata"]["coherence_score"] = round(normalized_score, 4)
# 3. Generate a deterministic ID based on text so duplicates never happen
deterministic_id = hashlib.md5(text.encode('utf-8')).hexdigest()
chunk["chunk_id"] = str(uuid.UUID(deterministic_id))
valid_chunks.append(chunk)
return valid_chunks
print("Validating chunks and calculating coherence scores...")
scored_chunks = score_and_filter_chunks(chunks)
print(f"Kept {len(scored_chunks)} highly coherent chunks (Filtered out {len(chunks) - len(scored_chunks)} bad chunks).")
# Replace chunks with our scored and filtered list
chunks = scored_chunks
Step 6: Initialize MedCPT Article Encoder (Using GPU)
Kaggle's T4 GPUs will load the ncbi/MedCPT-Article-Encoder. This model is specifically trained on PubMed articles and clinical notes!
print("Loading MedCPT Article Encoder onto T4 GPUs...")
# model_kwargs={'device': 'cuda'} forces the model to use the GPUs
embeddings_model = HuggingFaceEmbeddings(
model_name="ncbi/MedCPT-Article-Encoder",
model_kwargs={'device': 'cuda'}
)
# MedCPT outputs 768 dimensional vectors
VECTOR_SIZE = 768
Step 7: Push Embeddings to Qdrant Cloud
This script embeds the chunks and pushes them over the internet directly to your Qdrant Cloud cluster.
# Initialize Qdrant Client connected to your Cloud cluster
client = QdrantClient(
url=QDRANT_URL,
api_key=QDRANT_API_KEY
)
# Create the collection if it doesn't exist
if not client.collection_exists(COLLECTION_NAME):
client.create_collection(
collection_name=COLLECTION_NAME,
vectors_config=VectorParams(size=VECTOR_SIZE, distance=Distance.COSINE),
)
print(f"Created new collection: {COLLECTION_NAME}")
print("Embedding and pushing to Qdrant in batches...")
BATCH_SIZE = 64 # Use 64 to maximize GPU usage
for i in range(0, len(chunks), BATCH_SIZE):
batch = chunks[i:i + BATCH_SIZE]
# Generate embeddings using MedCPT Article Encoder
texts = [item["text"] for item in batch]
batch_embeddings = embeddings_model.embed_documents(texts)
points = []
for j, item in enumerate(batch):
points.append(
PointStruct(
id=item["chunk_id"],
vector=batch_embeddings[j],
payload={
"source": item["source"],
"text": item["text"],
"headers": item["metadata"],
"coherence_score": item["metadata"].get("coherence_score", 0)
}
)
)
client.upsert(collection_name=COLLECTION_NAME, points=points)
print(f"Pushed chunks {i} to {i + len(batch)} / {len(chunks)}...")
print("✅ Data Ingestion Pipeline Complete! Your vectors are now live in Qdrant Cloud.")
What to do AFTER Ingestion? (Merging into Local Directory)
Because Qdrant is hosted in the Cloud, you do not need to download or merge any database files back into your local directory! The vectors are instantly available globally.
However, you must update your local backend project to use the matching MedCPT-Query-Encoder so it can search properly.
Update your
.envfile in your local project to swap the embedding model:# Change embedding model from qwen3-embedding to MedCPT Query Encoder OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder GW_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder AKS_OLLAMA_EMBEDDING_MODEL=ncbi/MedCPT-Query-Encoder(Note: You will also need to pull this model locally via Ollama or HuggingFace locally, or configure your backend
retrieval.pyto use HuggingFaceEmbeddings instead of Ollama for the Query Encoder).Refactor the Retrieval Layer: Once the ingestion is complete, inform your AI assistant so it can update
src/retrieval.pyto searchqdrant-clientusing the newMedCPT-Query-Encoderinstead of the old PostgreSQLpgvectorcode.