Spaces:
Running
Running
| title: MedSight AI Backend | |
| emoji: π₯ | |
| colorFrom: blue | |
| colorTo: blue | |
| sdk: docker | |
| app_port: 7860 | |
| pinned: true | |
| license: apache-2.0 | |
| <div align="center"> | |
| # π₯ MedSight AI | |
| ### Multimodal Medical Diagnostic Platform | |
| **AI-Powered Pulmonary Anomaly Detection Fusing Computer Vision, NLP, and Retrieval-Augmented Generation** | |
| [](https://python.org) | |
| [](https://pytorch.org) | |
| [](https://fastapi.tiangolo.com) | |
| [](https://nextjs.org) | |
| [](LICENSE) | |
| [Live Demo](#deployment) Β· [Research Paper](#research-paper) Β· [API Docs](#api-reference) Β· [Architecture](#system-architecture) | |
| </div> | |
| --- | |
| ## π Table of Contents | |
| - [Overview](#overview) | |
| - [Key Features](#key-features) | |
| - [System Architecture](#system-architecture) | |
| - [7-Stage Analysis Pipeline](#7-stage-analysis-pipeline) | |
| - [VRAM-Aware Model Registry](#vram-aware-model-registry) | |
| - [NLP Pipeline](#nlp-pipeline) | |
| - [3-Tier RAG Conversational Architecture](#3-tier-rag-conversational-architecture) | |
| - [Model Pipeline β VGG16 β VAE β ViT](#model-pipeline--vgg16--vae--vit) | |
| - [Fused Anomaly Score](#fused-anomaly-score) | |
| - [Interpretability β Clinical Attention Heatmaps](#interpretability--clinical-attention-heatmaps) | |
| - [Training & Experimental Results](#training--experimental-results) | |
| - [Ablation Study](#ablation-study--fusion-component-analysis) | |
| - [Tech Stack](#tech-stack) | |
| - [Project Structure](#project-structure) | |
| - [Getting Started](#getting-started) | |
| - [Configuration](#configuration) | |
| - [API Reference](#api-reference) | |
| - [Deployment](#deployment) | |
| - [Research Paper](#research-paper) | |
| - [Contributing](#contributing) | |
| - [License](#license) | |
| --- | |
| ## Overview | |
| **MedSight AI** is a full-stack multimodal medical diagnostic platform that performs automated pulmonary anomaly detection from chest X-ray images. The system fuses deep learningβbased computer vision with clinical NLP and a retrieval-augmented generation (RAG) pipeline to deliver comprehensive diagnostic reports, clinical Q&A, and explainable AI visualizations β all through a modern clinical dashboard. | |
| The platform is designed as a **clinical decision-support tool** (not a replacement for physicians) that assists radiologists and clinicians by: | |
| - Detecting pulmonary anomalies in chest X-rays using a novel **VGG16 β VAE β ViT** architecture (2.53M trainable parameters) | |
| - Extracting clinical entities from patient symptom descriptions via **scispaCy NER** and **zero-shot disease classification** | |
| - Generating patient-friendly diagnostic explanations through **Gemini 2.0 Flash**βpowered conversational AI | |
| - Producing downloadable **PDF diagnostic reports** with heatmap visualizations | |
| > β οΈ **Disclaimer:** MedSight AI is a research prototype for educational and clinical decision-support purposes. It is **not** FDA-approved and should not be used as the sole basis for medical diagnosis or treatment. | |
| --- | |
| ## Key Features | |
| | Feature | Description | | |
| |---|---| | |
| | π¬ **Anomaly Detection** | Novel VGG16 β VAE β ViT pipeline that detects anomalies via reconstruction error, KL divergence, and attention-based scoring | | |
| | πΊοΈ **Heatmap Visualization** | Clinical Grad-CAMβstyle attention overlays showing regions of interest on X-rays | | |
| | π§ **NLP Entity Extraction** | scispaCy-powered medical NER extracting diseases, symptoms, medications, and anatomical entities | | |
| | π·οΈ **Disease Classification** | Zero-shot classification using DistilBART-MNLI with rule-based fallbacks | | |
| | π **Multimodal Fusion** | Image-text alignment scoring to correlate imaging findings with clinical narratives | | |
| | π¬ **AI Clinical Chat** | Gemini 2.0 Flashβpowered RAG chatbot with session-aware context and intent detection | | |
| | π **PDF Reports** | Auto-generated diagnostic reports with heatmaps, findings, and recommendations | | |
| | ποΈ **Voice Input** | Whisper-powered speech-to-text for hands-free symptom entry | | |
| | π **Authentication** | JWT + Google OAuth 2.0 with secure session management and brute-force protection | | |
| | π **Patient Dashboard** | Comprehensive analysis history, risk tracking, and session management | | |
| --- | |
| ## System Architecture | |
| <p align="center"> | |
| <img src="docs/images/system_architecture.png" alt="MedSight AI System Architecture" width="800"/> | |
| </p> | |
| MedSight AI is deployed as a **production-grade web application** with a React/Next.js 14 frontend and an async FastAPI backend. The architecture cleanly separates vision, NLP, and conversational AI pipelines behind a unified REST API. | |
| ### 7-Stage Analysis Pipeline | |
| Every X-ray analysis request flows through a deterministic 7-stage orchestration pipeline (`backend/orchestration/pipeline.py`): | |
| ``` | |
| ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ | |
| β 1. Input ββββΆβ 2. VisionββββΆβ 3. VRAM ββββΆβ 4. NLP ββββΆβ 5. Multi ββββΆβ 6. ReportββββΆβ 7. Statusβ | |
| β Validate β β Analysis β β Cleanup β β Analysis β β Fusion β β Gen β β Return β | |
| ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ ββββββββββββ | |
| Preprocess VGG16βVAE torch.cuda scispaCy NER BiomedVLP BioGPT or COMPLETE / | |
| 224Γ224 RGB βViT scorer empty_cache + DistilBART alignment Template PARTIAL / | |
| LANCZOS + heatmap (GPU only) zero-shot scoring fallback FAILED | |
| ``` | |
| Each stage runs asynchronously with **independent error handling** β if vision fails, NLP still runs. The system returns `COMPLETE`, `PARTIAL`, or `FAILED` depending on which stages succeeded. | |
| ### VRAM-Aware Model Registry | |
| A custom `ModelRegistry` manages six ML models with **priority-based loading**, **LRU GPU eviction**, and **async initialization**. This enables deployment on consumer hardware with as little as **4 GB VRAM**: | |
| | Priority | Model | HuggingFace ID | RAM | Required | Purpose | | |
| |:---:|---|---|:---:|:---:|---| | |
| | 1 | VGG16+VAE+ViT | `hoshikrana/VAE_and_VIT_Anomaly_detection` | 50 MB | β | Anomaly detection | | |
| | 1 | MiniLM-L6-v2 | `sentence-transformers/all-MiniLM-L6-v2` | 100 MB | β | RAG embeddings | | |
| | 2 | scispaCy NER | `en_core_sci_sm` | 100 MB | β | Medical entity extraction | | |
| | 3 | Whisper Tiny | `openai/whisper-tiny` | 300 MB | β | Voice transcription | | |
| | 4 | BioGPT | `microsoft/biogpt` | 700 MB | β | Report generation | | |
| | 5 | DistilBART | `valhalla/distilbart-mnli-12-1` | 300 MB | β | Zero-shot classification | | |
| The registry supports **dynamic GPUβCPU migration** β when a higher-priority model needs GPU memory, the least-recently-used GPU model is evicted to CPU automatically. | |
| ### NLP Pipeline | |
| The NLP module processes clinical notes through three stages: | |
| 1. **Named Entity Recognition** β scispaCy (`en_core_sci_sm`) extracts diseases, symptoms, medications, and anatomical references from patient text | |
| 2. **Zero-Shot Classification** β DistilBART-MNLI classifies clinical text against 20 pulmonary conditions without task-specific fine-tuning (falls back to rule-based matching if the model isn't loaded) | |
| 3. **Multimodal Fusion** β Optional BiomedVLP image-text alignment scoring correlates imaging findings with clinical narratives, with a keyword-based fallback for constrained environments | |
| ### 3-Tier RAG Conversational Architecture | |
| The conversational module implements a **highly resilient 3-tier Retrieval-Augmented Generation** system that never fails silently: | |
| | Tier | Engine | Method | Latency | | |
| |:---:|---|---|:---:| | |
| | **Tier 1** | Gemini 2.0 Flash (Cloud) | Streaming SSE with dynamic system instructions | ~1.5s | | |
| | **Tier 2** | BioGPT (Local) | Beam search decoding (num_beams=4) | ~3s | | |
| | **Tier 3** | Heuristic Templates | Intent-detection rule engine with 8 intent categories | ~5ms | | |
| **Context construction** aggregates: vision anomaly scores β NLP predictions β fusion similarity β patient session history β retrieved PubMed abstracts (via MiniLM-L6-v2 + ChromaDB HNSW indexing). All tiers prohibit dosage recommendations and append medical disclaimers. | |
| --- | |
| ## Model Pipeline β VGG16 β VAE β ViT | |
| <p align="center"> | |
| <img src="docs/images/model_architecture.png" alt="Three-Stage Anomaly Detection Architecture" width="800"/> | |
| </p> | |
| The core anomaly detection system implements a novel **three-stage unsupervised architecture** with only **2.53M trainable parameters**. The model is trained exclusively on normal chest X-rays and detects anomalies by learning the distribution of healthy pulmonary anatomy β requiring **zero pathology-specific labels**. | |
| ### Stage 1 β VGG16 Feature Extraction (0 trainable params) | |
| Pre-trained VGG16 (ImageNet) serves as a **frozen feature extractor**. Convolutional feature maps are globally average-pooled to produce a compact representation per image. | |
| ``` | |
| Input: 224Γ224Γ3 RGB (ImageNet-normalized) | |
| β VGG16.features (frozen) | |
| β AdaptiveAvgPool2d(1,1) | |
| β Flatten | |
| β Output: ββ΅ΒΉΒ² feature vector | |
| ``` | |
| **Why freeze?** (i) Deterministic features ensure stable VAE training; (ii) zero gradient storage saves VRAM; (iii) ImageNet features transfer well to medical imaging (Raghu et al., 2019). | |
| ### Stage 2 β Variational Autoencoder (1,318,656 params) | |
| The VAE learns a **smooth, continuous latent manifold** of normal pulmonary anatomy. During inference, pathological images produce higher reconstruction error and KL divergence because they fall outside the learned normal distribution. | |
| ``` | |
| Encoder: 512 β 512 β 384 β 256 β [ΞΌ, log ΟΒ²] (each with LayerNorm + GELU + Dropout 0.1) | |
| β | |
| Reparameterization: z = ΞΌ + Ξ΅Β·Ο (Ξ΅ ~ N(0,1)) | |
| β | |
| Decoder: 256 β 384 β 512 β 512 (symmetric architecture) | |
| β | |
| Output: xΜ (reconstructed features) | |
| ``` | |
| **Loss function** β Evidence Lower Bound (ELBO): | |
| ``` | |
| L_VAE = L_recon + Ξ² Β· L_KL | |
| where: L_recon = MSE(xΜ, x) | |
| L_KL = -Β½ Ξ£(1 + log ΟΒ² - ΞΌΒ² - ΟΒ²) | |
| Ξ² = 0.001 (Ξ²-VAE formulation to prevent posterior collapse) | |
| ``` | |
| ### Stage 3 β Vision Transformer Anomaly Scorer (1,209,729 params) | |
| The ViT operates on the **latent vector z** (not raw pixels), treating it as a sequence of patches for self-attention-based anomaly scoring. This is a key architectural decision β the ViT scores the quality of the latent representation rather than the image directly. | |
| ``` | |
| z β βΒ²β΅βΆ β reshape to 8 patches of dim 32 | |
| β Linear projection to d_model = 128 | |
| β Prepend learnable [CLS] token | |
| β Add positional embeddings (9 tokens = 8 patches + CLS) | |
| β 6Γ Transformer Blocks (8-head attention, MLP dim 512, GELU, Dropout 0.1) | |
| β LayerNorm | |
| β [CLS] token β MLP head β Sigmoid β anomaly score β [0, 1] | |
| ``` | |
| | Hyperparameter | Value | | |
| |---|:---:| | |
| | Latent dimension | 256 | | |
| | Patch dimension | 32 | | |
| | Number of patches | 8 | | |
| | Model dimension (d_model) | 128 | | |
| | Transformer depth | 6 layers | | |
| | Attention heads | 8 | | |
| | MLP dimension | 512 | | |
| | Dropout | 0.1 | | |
| | Output activation | Sigmoid β [0, 1] | | |
| ### Fused Anomaly Score | |
| The final anomaly score fuses **three complementary signals** via weighted linear combination after normalizing each component using calibration statistics computed on the training set: | |
| ``` | |
| S_anomaly = wβ Β· Ο((e_recon - ΞΌ_recon) / Ο_recon) | |
| + wβ Β· Ο((d_KL - ΞΌ_KL) / Ο_KL) | |
| + wβ Β· s_ViT | |
| where: wβ = 0.4 (reconstruction error β pixel-level deviations) | |
| wβ = 0.2 (KL divergence β distributional shift) | |
| wβ = 0.4 (ViT score β higher-order latent abnormalities) | |
| Ο = sigmoid normalization | |
| ``` | |
| The optimal threshold of **0.348** was determined by maximizing the Youden index on the validation set. | |
| ### Interpretability β Clinical Attention Heatmaps | |
| To provide **visual explainability**, the system extracts [CLS] token attention weights from the final ViT layer: | |
| 1. Average attention across all 8 heads β patch-level attention vector | |
| 2. Reshape into 2D grid and upsample to 384Γ384 via bicubic interpolation | |
| 3. Apply clinical colormap (black β dark red β orange β bright yellow) | |
| 4. Adaptive transparency mask ensures only anomalous regions glow over the X-ray | |
| 5. CLAHE enhancement on base radiograph maximizes anatomical contrast | |
| The result is a three-panel visualization: **Original X-ray** | **Attention Heatmap** | **Clinical Overlay** with anomaly score. | |
| --- | |
| ## Training & Experimental Results | |
| ### Dataset β COVID-19 Radiography Database | |
| | Class | Count | Usage | | |
| |---|:---:|---| | |
| | Normal | 10,192 | Training (unsupervised β model only sees this) | | |
| | COVID-19 | 3,616 | Evaluation only | | |
| | Lung Opacity | 6,012 | Evaluation only | | |
| | Viral Pneumonia | 1,345 | Evaluation only | | |
| | **Total** | **21,165** | β | | |
| **Preprocessing:** 224Γ224 Lanczos resize, RGB, ImageNet normalization. Training augmentations: random horizontal flip (p=0.5), rotation (Β±10Β°), color jitter (Β±0.2). | |
| ### Two-Phase Training Protocol | |
| **Phase 1 β VAE Training (50 epochs)** | |
| - AdamW optimizer (lr=1Γ10β»β΄, weight_decay=1Γ10β»β΅) | |
| - ReduceLROnPlateau scheduler (factor=0.5, patience=3) | |
| - Batch size 32, Ξ²=0.001, early stopping patience 10 | |
| - Resource-efficient: mixed-precision FP16, gradient accumulation (4 steps β effective batch 128) | |
| **Phase 2 β ViT Scorer Training (30 epochs)** | |
| - AdamW optimizer (lr=5Γ10β»β΅, weight_decay=1Γ10β»β΅) | |
| - Binary cross-entropy: Normalβ0, Anomalyβ1 | |
| - Only the ViT uses labels; the VAE remains **fully unsupervised** | |
| ### Results | |
| | Metric | Value | | |
| |---|:---:| | |
| | **AUROC** | **0.718** | | |
| | ViT Validation Accuracy | 98.6% | | |
| | VAE Final Reconstruction MSE | 0.0152 | | |
| | VAE Ξ²Β·KL Divergence | 6.97Γ10β»β΄ | | |
| | True Positives (AnomalyβAnomaly) | 4,974 | | |
| | True Negatives (NormalβNormal) | 1,017 | | |
| | Sensitivity (Recall) | 64.7% | | |
| | Specificity | 66.5% | | |
| | Optimal Threshold | 0.348 | | |
| | Total Trainable Parameters | **2,528,385** | | |
| ### Ablation Study β Fusion Component Analysis | |
| | Configuration | AUROC | Notes | | |
| |---|:---:|---| | |
| | Reconstruction error only | 0.62 | MSE between VGG features and reconstruction | | |
| | KL divergence only | 0.68 | Strongest single signal | | |
| | ViT score only | 0.65 | Latent-space attention scoring | | |
| | Recon. + KL (w/o ViT) | 0.69 | Traditional VAE anomaly detection | | |
| | **Full fusion (0.4 / 0.2 / 0.4)** | **0.718** | **Best configuration** | | |
| Each component provides **complementary information** β reconstruction error captures pixel-level deviations, KL divergence captures distributional shift, and the ViT captures higher-order latent abnormalities via attention. | |
| ### Latent Space Validation (UMAP) | |
| UMAP projection of the 256-dimensional VAE latent space reveals **emergent clustering without supervision**: | |
| - **Normal** images cluster tightly β the VAE learned a compact representation of healthy anatomy | |
| - **Lung Opacity** forms a distinct separable cluster β the most detectable anomaly class | |
| - **Viral Pneumonia** partially overlaps with normal β explaining its harder detectability | |
| - **COVID-19** cases are sparse and widely distributed β reflecting heterogeneous radiographic presentations | |
| --- | |
| ## Tech Stack | |
| ### Backend | |
| | Component | Technology | | |
| |---|---| | |
| | Framework | FastAPI 0.110 (async, Pydantic v2) | | |
| | ML Runtime | PyTorch 2.2 + ONNX Runtime | | |
| | NLP | scispaCy, HuggingFace Transformers, BioGPT | | |
| | Embeddings | Sentence-Transformers (MiniLM-L6-v2) | | |
| | Vector DB | ChromaDB 0.4.24 | | |
| | Generative AI | Google Gemini 2.0 Flash | | |
| | Database | SQLAlchemy 2.0 (SQLite dev / PostgreSQL prod) | | |
| | Auth | JWT + Google OAuth 2.0 (Authlib) | | |
| | Task Scheduling | APScheduler | | |
| | PDF Generation | ReportLab | | |
| ### Frontend | |
| | Component | Technology | | |
| |---|---| | |
| | Framework | Next.js 14 (App Router) | | |
| | Styling | Tailwind CSS 3.4 | | |
| | Animations | Framer Motion 11 | | |
| | Charts | Recharts 2.12 | | |
| | Icons | Lucide React | | |
| | HTTP Client | Axios | | |
| | Deployment | Vercel | | |
| ### Infrastructure | |
| | Component | Technology | | |
| |---|---| | |
| | Containerization | Docker (Python 3.11-slim) | | |
| | Backend Hosting | HuggingFace Spaces (Docker SDK) | | |
| | Frontend Hosting | Vercel | | |
| | Model Distribution | HuggingFace Hub | | |
| | Object Storage | Cloudflare R2 (optional) | | |
| | Database (Prod) | Supabase PostgreSQL | | |
| --- | |
| ## Project Structure | |
| ``` | |
| MedSightAI/ | |
| βββ backend/ | |
| β βββ api/v1/ | |
| β β βββ routers/ # FastAPI route handlers | |
| β β β βββ analyze.py # X-ray upload & analysis | |
| β β β βββ auth.py # JWT + OAuth authentication | |
| β β β βββ chat.py # RAG-powered clinical Q&A | |
| β β β βββ report.py # PDF report generation | |
| β β β βββ users.py # User profiles & session history | |
| β β βββ schemas/ # Pydantic v2 request/response models | |
| β βββ core/ | |
| β β βββ config.py # Pydantic settings (env-driven) | |
| β β βββ security.py # JWT, password hashing, API keys | |
| β β βββ middleware.py # CORS, rate limiting, security headers | |
| β β βββ exceptions.py # Custom exception hierarchy | |
| β βββ db/ | |
| β β βββ models/ # SQLAlchemy ORM models | |
| β β βββ migrations/ # Alembic migration scripts | |
| β β βββ session.py # Async database session factory | |
| β βββ ml/ | |
| β β βββ vision/ | |
| β β β βββ pulmonary_anomaly.py # VGG16βVAEβViT detector | |
| β β β βββ anomaly.py # ONNX ConvAE fallback | |
| β β β βββ hf_download.py # HuggingFace model auto-download | |
| β β βββ nlp/ | |
| β β β βββ ner.py # scispaCy medical NER | |
| β β β βββ classifier.py # Zero-shot disease classification | |
| β β β βββ whisper.py # Voice-to-text transcription | |
| β β βββ rag/ | |
| β β β βββ gemini_client.py # Gemini 2.0 Flash integration | |
| β β β βββ generator.py # BioGPT report + chat generation | |
| β β β βββ retriever.py # ChromaDB vector retrieval | |
| β β β βββ vectorstore.py # Embedding + indexing pipeline | |
| β β βββ fusion/ | |
| β β β βββ medclip.py # Multimodal image-text alignment | |
| β β βββ registry.py # Model lifecycle manager | |
| β βββ orchestration/ | |
| β β βββ pipeline.py # 7-stage analysis orchestrator | |
| β β βββ queue.py # Async task queue | |
| β β βββ resilience.py # Retry, circuit-breaker, fallbacks | |
| β β βββ scheduler.py # Periodic cleanup tasks | |
| β β βββ workers.py # Background worker pool | |
| β βββ utils/ | |
| β βββ pdf.py # Clinical PDF report builder | |
| β βββ image.py # Image preprocessing utilities | |
| β βββ audio.py # Audio format handling | |
| β βββ validators.py # Input validation helpers | |
| βββ frontend/ | |
| β βββ app/ # Next.js App Router pages | |
| β β βββ (auth)/ # Login / Registration pages | |
| β β βββ (dashboard)/ # Analysis dashboard | |
| β β βββ about/ # About page | |
| β β βββ profile/ # User profile & history | |
| β βββ components/ | |
| β β βββ analysis/ # Upload panel, results viewer | |
| β β βββ chat/ # AI chat interface | |
| β β βββ shared/ # Navbar, layout components | |
| β β βββ ui/ # Reusable UI primitives | |
| β βββ lib/ # API client, auth context, utilities | |
| βββ training/ | |
| β βββ notebooks/ # Jupyter training notebooks | |
| β βββ scripts/ # Data preparation & training scripts | |
| βββ data/ # Raw/processed data & uploads | |
| βββ models/ # Cached model weights | |
| βββ results/ # Training outputs & evaluation | |
| βββ Dockerfile # Production Docker image | |
| βββ requirements.txt # Python dependencies | |
| βββ .env.example # Environment variable template | |
| ``` | |
| --- | |
| ## Getting Started | |
| ### Prerequisites | |
| - **Python** 3.10 or higher | |
| - **Node.js** 18+ and npm | |
| - **Git** and **Git LFS** (for model weights) | |
| - **(Optional)** CUDA 11.8+ compatible GPU for accelerated inference | |
| ### 1. Clone the Repository | |
| ```bash | |
| git clone https://github.com/hoshikrana/MedSightAI.git | |
| cd MedSightAI | |
| ``` | |
| ### 2. Backend Setup | |
| ```bash | |
| # Create virtual environment | |
| python -m venv venv | |
| # Activate (Windows) | |
| .\venv\Scripts\activate | |
| # Activate (macOS/Linux) | |
| source venv/bin/activate | |
| # Install PyTorch (GPU β CUDA 11.8) | |
| pip install torch==2.2.0+cu118 torchvision==0.17.0+cu118 --index-url https://download.pytorch.org/whl/cu118 | |
| # OR install PyTorch (CPU-only) | |
| pip install torch==2.2.0+cpu torchvision==0.17.0+cpu --index-url https://download.pytorch.org/whl/cpu | |
| # Install remaining dependencies | |
| pip install -r requirements.txt | |
| # Install scispaCy model | |
| pip install https://s3-us-west-2.amazonaws.com/ai2-s3-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz | |
| ``` | |
| ### 3. Environment Configuration | |
| ```bash | |
| # Copy the example environment file | |
| cp .env.example .env | |
| # Generate secure keys | |
| python -c "import secrets; print('SECRET_KEY=' + secrets.token_hex(32))" | |
| python -c "import secrets; print('JWT_SECRET_KEY=' + secrets.token_hex(32))" | |
| ``` | |
| Edit `.env` with your configuration. Required variables: | |
| - `SECRET_KEY` β Application secret (min 32 chars) | |
| - `JWT_SECRET_KEY` β JWT signing key (min 32 chars) | |
| - `GEMINI_API_KEY` β [Get free API key](https://aistudio.google.com/app/apikey) for AI chat | |
| - `GOOGLE_CLIENT_ID` / `GOOGLE_CLIENT_SECRET` β For OAuth (optional) | |
| ### 4. Frontend Setup | |
| ```bash | |
| cd frontend | |
| npm install | |
| ``` | |
| Create `frontend/.env.local`: | |
| ```env | |
| NEXT_PUBLIC_API_URL=http://localhost:8000 | |
| NEXT_PUBLIC_APP_NAME=MedSight AI | |
| ``` | |
| ### 5. Run the Application | |
| ```bash | |
| # Terminal 1 β Backend (from project root) | |
| python -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000 | |
| # Terminal 2 β Frontend | |
| cd frontend | |
| npm run dev | |
| ``` | |
| Open [http://localhost:3000](http://localhost:3000) in your browser. | |
| --- | |
| ## Configuration | |
| ### Environment Variables | |
| | Variable | Default | Description | | |
| |---|---|---| | |
| | `ENVIRONMENT` | `development` | `development` / `production` / `test` | | |
| | `SECRET_KEY` | *required* | Application secret key (β₯32 chars) | | |
| | `DATABASE_URL` | `sqlite+aiosqlite:///./medsight.db` | Database connection string | | |
| | `GEMINI_API_KEY` | β | Google Gemini API key for AI chat | | |
| | `HF_TOKEN` | β | HuggingFace token for model downloads | | |
| | `ALLOWED_ORIGINS` | `http://localhost:3000` | Comma-separated exact frontend origins | | |
| | `ALLOWED_ORIGIN_REGEX` | `https://.*\.vercel\.app` | Regex for Vercel preview/production origins | | |
| | `TRUSTED_HOSTS` | `localhost,127.0.0.1,*.vercel.app,*.hf.space` | Hosts accepted by TrustedHostMiddleware | | |
| | `VISION_ANOMALY_BACKEND` | `auto` | `auto` / `onnx` / `pulmonary` | | |
| | `GPU_VRAM_BUDGET_MB` | `3500` | Max VRAM budget for model loading | | |
| | `MAX_UPLOAD_SIZE_MB` | `10` | Maximum upload file size | | |
| | `STORAGE_BACKEND` | `local` | `local` / `r2` (Cloudflare R2) | | |
| | `RATE_LIMIT_ANALYZE` | `10/hour` | Analysis endpoint rate limit | | |
| | `RATE_LIMIT_CHAT` | `50/hour` | Chat endpoint rate limit | | |
| See [`.env.example`](.env.example) for the complete list of configurable options. | |
| ### Vision Backend Selection | |
| The `VISION_ANOMALY_BACKEND` setting controls which vision model is used: | |
| | Mode | Description | | |
| |---|---| | |
| | `auto` | Auto-detects available checkpoints (prefers `pulmonary` β `onnx`) | | |
| | `pulmonary` | Uses the VGG16βVAEβViT `.pth` checkpoint | | |
| | `onnx` | Uses the ConvAE ONNX model for lightweight CPU inference | | |
| --- | |
| ## API Reference | |
| ### Core Endpoints | |
| | Method | Endpoint | Description | Auth | | |
| |---|---|---|---| | |
| | `POST` | `/api/v1/analyze` | Upload X-ray image + symptoms for analysis | β | | |
| | `GET` | `/api/v1/analyze/status/{task_id}` | Poll analysis task status | β | | |
| | `GET` | `/api/v1/analyze/result/{session_id}` | Retrieve completed analysis results | β | | |
| | `POST` | `/api/v1/chat` | AI-powered clinical Q&A (streaming) | β | | |
| | `GET` | `/api/v1/report/{session_id}` | Generate & download PDF report | β | | |
| | `GET` | `/api/v1/health` | System health check | β | | |
| | `GET` | `/docs` | Interactive Swagger UI (dev only) | β | | |
| ### Authentication Endpoints | |
| | Method | Endpoint | Description | | |
| |---|---|---| | |
| | `POST` | `/api/v1/auth/register` | Email/password registration | | |
| | `POST` | `/api/v1/auth/login` | Email/password login β JWT tokens | | |
| | `POST` | `/api/v1/auth/refresh` | Refresh access token | | |
| | `GET` | `/api/v1/auth/google` | Initiate Google OAuth flow | | |
| | `GET` | `/api/v1/auth/google/callback` | Google OAuth callback | | |
| ### Analysis Response Schema | |
| ```json | |
| { | |
| "session_id": "uuid", | |
| "overall_status": "COMPLETE | PARTIAL | FAILED", | |
| "vision": { | |
| "anomaly_score": 72.5, | |
| "risk_level": "HIGH", | |
| "heatmap_base64": "data:image/png;base64,...", | |
| "top_regions": [{"x": 76, "y": 56, "width": 72, "height": 86, "confidence": 0.85}], | |
| "model_confidence": 0.82 | |
| }, | |
| "nlp": { | |
| "entities": {"diseases": [...], "symptoms": [...], "medications": [...]}, | |
| "primary_diagnosis": "Pneumonia", | |
| "diagnosis_confidence": 0.78, | |
| "differential": [{"disease": "Pleural Effusion", "confidence": 0.45}] | |
| }, | |
| "fusion": { | |
| "image_text_similarity": 0.72, | |
| "alignment": "moderate", | |
| "final_risk": "MEDIUM" | |
| }, | |
| "report_text": "## AI Diagnostic Report ...", | |
| "timings": { | |
| "preprocess_ms": 45, | |
| "vision_ms": 1200, | |
| "nlp_ms": 350, | |
| "fusion_ms": 120, | |
| "report_ms": 800, | |
| "total_ms": 2515 | |
| } | |
| } | |
| ``` | |
| --- | |
| ## Deployment | |
| ### Production Architecture | |
| | Service | Platform | Purpose | | |
| |---|---|---| | |
| | **Backend API** | HuggingFace Spaces (Docker SDK) | FastAPI + ML inference on port 7860 | | |
| | **Frontend** | Vercel | Next.js static + SSR | | |
| | **Database** | Supabase | Managed PostgreSQL | | |
| | **Models** | HuggingFace Hub | Model weight distribution | | |
| | **Storage** | Cloudflare R2 | Medical image storage (optional) | | |
| ### Docker Deployment | |
| ```bash | |
| # Build the production image | |
| docker build -t medsight-ai . | |
| # Run locally | |
| docker run -p 7860:7860 --env-file .env medsight-ai | |
| ``` | |
| The Dockerfile uses `python:3.11-slim`, installs CPU-only PyTorch (~800MB smaller than CUDA), and runs Uvicorn with a single worker. Peak memory is approximately **4GB** during inference. | |
| ### HuggingFace Spaces | |
| The backend is configured to deploy directly to HuggingFace Spaces via the Docker SDK. The HuggingFace metadata is in the `README.md` frontmatter. Models are auto-downloaded from `hoshikrana/VAE_and_VIT_Anomaly_detection` on startup. | |
| Required GitHub repository secrets for the deployment workflow: | |
| | Secret | Purpose | | |
| |---|---| | |
| | `HF_TOKEN` | Hugging Face write token for uploading the Space and runtime model downloads | | |
| | `HF_SPACE_ID` | Space repo id, for example `username/medsight-ai-backend` | | |
| | `HF_SPACE_URL` | Public backend URL, for example `https://username-medsight-ai-backend.hf.space` | | |
| | `VERCEL_TOKEN` | Vercel CLI token | | |
| | `VERCEL_ORG_ID` | Vercel team/user id | | |
| | `VERCEL_PROJECT_ID` | Vercel project id for the frontend | | |
| Set these Hugging Face Space runtime variables as secrets or variables: | |
| ```env | |
| ENVIRONMENT=production | |
| SECRET_KEY=<64-hex-or-long-random-secret> | |
| JWT_SECRET_KEY=<different-64-hex-or-long-random-secret> | |
| DATABASE_URL=<production-postgres-url-or-sqlite-for-demo-only> | |
| ALLOWED_ORIGINS=https://<your-vercel-domain> | |
| ALLOWED_ORIGIN_REGEX=https://.*\.vercel\.app | |
| TRUSTED_HOSTS=*.hf.space,localhost,127.0.0.1 | |
| FRONTEND_URL=https://<your-vercel-domain> | |
| BACKEND_URL=https://<your-hf-space-subdomain>.hf.space | |
| HF_TOKEN=<token-if-model-repo-is-private> | |
| ``` | |
| --- | |
| ## Research Paper | |
| This project is accompanied by a peer-reviewed research paper: | |
| > **"MedSight AI: A Multimodal Deep Learning Framework for Unsupervised Pulmonary Anomaly Detection with Retrieval-Augmented Clinical Decision Support"** | |
| > | |
| > Kasala Hoshik, V. Vineel Reddy, K. Chanikya | |
| > Lovely Professional University, Phagwara, Punjab, India | |
| > Research | May 2026 | |
| ### Key Research Contributions | |
| 1. **Novel three-stage architecture (VGG16 β VAE β ViT)** β Decomposes anomaly detection into feature extraction, distributional learning, and attention-based scoring with only 2.53M trainable parameters (vs. 86M in ViT-Base or 307M in DINOv2) | |
| 2. **Unsupervised paradigm shift** β Trained exclusively on normal radiographs, eliminating the need for expensive per-pathology annotation. Can detect novel/rare pathologies absent from training data | |
| 3. **Multi-signal interpretable scoring** β Fusion of reconstruction error, KL divergence, and ViT attention provides clinicians with three complementary perspectives on why an image was flagged | |
| 4. **UMAP-validated latent representations** β Emergent clustering in the VAE latent space demonstrates pathology-relevant structure without any supervised signal | |
| 5. **Production-grade multimodal system** β Complete clinical platform integrating vision, NLP, and 3-tier RAG conversational AI with graceful degradation when individual components fail | |
| 6. **Resource-constrained deployment** β Full pipeline operates within 4 GB VRAM, enabling deployment on consumer hardware and CPU-only environments | |
| ### Strengths Highlighted in the Paper | |
| - **Clinical viability** β AUROC of 0.718 demonstrates unsupervised detection can provide clinically useful screening as a triage tool | |
| - **Extreme parameter efficiency** β 2.53M params vs. 86M (ViT-Base) or 307M (DINOv2) | |
| - **Interpretable multi-signal scoring** β Three complementary anomaly signals provide richer diagnostic information than single-metric approaches | |
| ### Future Directions | |
| - Perceptual loss (instead of MSE) for VAE reconstruction to better capture structural anomalies | |
| - Larger backbones (DINOv2 ViT-S/14 producing 384-d features) | |
| - Multi-scale latent analysis using hierarchical VAEs | |
| - Contrastive pre-training of the anomaly scorer | |
| - Domain-specific backbones (CheXNet) for improved viral pneumonia sensitivity | |
| --- | |
| ## Reproduce Training | |
| See [Training & Experimental Results](#training--experimental-results) above for full methodology and hyperparameters. | |
| ```bash | |
| # Prepare and preprocess the dataset | |
| python training/scripts/prepare_dataset.py | |
| # Train the VAE + ViT anomaly detector | |
| python training/scripts/train_anomaly.py | |
| # Or use the Jupyter notebook for interactive training | |
| jupyter notebook training/notebooks/covid\ \(1\).ipynb | |
| # Upload trained models to HuggingFace | |
| python training/scripts/upload_models.py | |
| ``` | |
| --- | |
| ## Contributing | |
| We welcome contributions! Please follow these steps: | |
| 1. Fork the repository | |
| 2. Create a feature branch (`git checkout -b feature/amazing-feature`) | |
| 3. Make your changes and ensure tests pass | |
| 4. Commit with descriptive messages (`git commit -m 'Add amazing feature'`) | |
| 5. Push to your branch (`git push origin feature/amazing-feature`) | |
| 6. Open a Pull Request | |
| ### Development Guidelines | |
| - **Backend:** Follow `ruff` and `black` formatting (see `pyproject.toml`) | |
| - **Frontend:** Follow ESLint + Prettier configuration | |
| - **Tests:** Add tests for new features (`pytest` for backend, `npm test` for frontend) | |
| - **Commits:** Use conventional commit messages | |
| ### Running Tests | |
| ```bash | |
| # Backend tests | |
| pytest backend/tests/ -v --tb=short | |
| # With specific markers | |
| pytest -m "unit" -v # Fast unit tests only | |
| pytest -m "integration" -v # Integration tests | |
| pytest -m "ml" -v # ML model tests | |
| # Frontend lint | |
| cd frontend && npm run lint | |
| ``` | |
| --- | |
| ## Acknowledgements | |
| - [COVID-19 Radiography Dataset](https://www.kaggle.com/datasets/tawsifurrahman/covid19-radiography-database) β Training data | |
| - [scispaCy](https://allenai.github.io/scispacy/) β Biomedical NLP models | |
| - [HuggingFace Transformers](https://huggingface.co/docs/transformers) β Model hub and inference | |
| - [Google Gemini](https://ai.google.dev/) β Generative AI for clinical chat | |
| - [FastAPI](https://fastapi.tiangolo.com/) β High-performance async API framework | |
| - [Next.js](https://nextjs.org/) β React framework for the frontend | |
| --- | |
| ## License | |
| This project is licensed under the **Apache License 2.0** β see the [LICENSE](LICENSE) file for details. | |
| --- | |
| <div align="center"> | |
| **Built with β€οΈ for advancing medical AI research** | |
| *MedSight AI is a research project and should not be used for clinical diagnosis without physician oversight.* | |
| </div> | |