MedSightAI / README.md
hoshikrana's picture
Deploy backend from GitHub Actions
1f3192e verified
|
Raw
History Blame Contribute Delete
33.6 kB
metadata
title: MedSight AI Backend
emoji: πŸ₯
colorFrom: blue
colorTo: blue
sdk: docker
app_port: 7860
pinned: true
license: apache-2.0

πŸ₯ MedSight AI

Multimodal Medical Diagnostic Platform

AI-Powered Pulmonary Anomaly Detection Fusing Computer Vision, NLP, and Retrieval-Augmented Generation

Python 3.10+ PyTorch 2.2 FastAPI Next.js 14 License

Live Demo Β· Research Paper Β· API Docs Β· Architecture


πŸ“‹ Table of Contents


Overview

MedSight AI is a full-stack multimodal medical diagnostic platform that performs automated pulmonary anomaly detection from chest X-ray images. The system fuses deep learning–based computer vision with clinical NLP and a retrieval-augmented generation (RAG) pipeline to deliver comprehensive diagnostic reports, clinical Q&A, and explainable AI visualizations β€” all through a modern clinical dashboard.

The platform is designed as a clinical decision-support tool (not a replacement for physicians) that assists radiologists and clinicians by:

  • Detecting pulmonary anomalies in chest X-rays using a novel VGG16 β†’ VAE β†’ ViT architecture (2.53M trainable parameters)
  • Extracting clinical entities from patient symptom descriptions via scispaCy NER and zero-shot disease classification
  • Generating patient-friendly diagnostic explanations through Gemini 2.0 Flash–powered conversational AI
  • Producing downloadable PDF diagnostic reports with heatmap visualizations

⚠️ Disclaimer: MedSight AI is a research prototype for educational and clinical decision-support purposes. It is not FDA-approved and should not be used as the sole basis for medical diagnosis or treatment.


Key Features

Feature Description
πŸ”¬ Anomaly Detection Novel VGG16 β†’ VAE β†’ ViT pipeline that detects anomalies via reconstruction error, KL divergence, and attention-based scoring
πŸ—ΊοΈ Heatmap Visualization Clinical Grad-CAM–style attention overlays showing regions of interest on X-rays
🧠 NLP Entity Extraction scispaCy-powered medical NER extracting diseases, symptoms, medications, and anatomical entities
🏷️ Disease Classification Zero-shot classification using DistilBART-MNLI with rule-based fallbacks
πŸ”— Multimodal Fusion Image-text alignment scoring to correlate imaging findings with clinical narratives
πŸ’¬ AI Clinical Chat Gemini 2.0 Flash–powered RAG chatbot with session-aware context and intent detection
πŸ“„ PDF Reports Auto-generated diagnostic reports with heatmaps, findings, and recommendations
πŸŽ™οΈ Voice Input Whisper-powered speech-to-text for hands-free symptom entry
πŸ” Authentication JWT + Google OAuth 2.0 with secure session management and brute-force protection
πŸ“Š Patient Dashboard Comprehensive analysis history, risk tracking, and session management

System Architecture

MedSight AI System Architecture

MedSight AI is deployed as a production-grade web application with a React/Next.js 14 frontend and an async FastAPI backend. The architecture cleanly separates vision, NLP, and conversational AI pipelines behind a unified REST API.

7-Stage Analysis Pipeline

Every X-ray analysis request flows through a deterministic 7-stage orchestration pipeline (backend/orchestration/pipeline.py):

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ 1. Input │──▢│ 2. Vision│──▢│ 3. VRAM  │──▢│ 4. NLP   │──▢│ 5. Multi │──▢│ 6. Report│──▢│ 7. Statusβ”‚
β”‚ Validate β”‚   β”‚ Analysis β”‚   β”‚ Cleanup  β”‚   β”‚ Analysis β”‚   β”‚  Fusion  β”‚   β”‚   Gen    β”‚   β”‚  Return  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
  Preprocess     VGG16β†’VAE     torch.cuda     scispaCy NER   BiomedVLP      BioGPT or      COMPLETE /
  224Γ—224 RGB    β†’ViT scorer   empty_cache    + DistilBART    alignment      Template       PARTIAL /
  LANCZOS        + heatmap     (GPU only)     zero-shot       scoring        fallback       FAILED

Each stage runs asynchronously with independent error handling β€” if vision fails, NLP still runs. The system returns COMPLETE, PARTIAL, or FAILED depending on which stages succeeded.

VRAM-Aware Model Registry

A custom ModelRegistry manages six ML models with priority-based loading, LRU GPU eviction, and async initialization. This enables deployment on consumer hardware with as little as 4 GB VRAM:

Priority Model HuggingFace ID RAM Required Purpose
1 VGG16+VAE+ViT hoshikrana/VAE_and_VIT_Anomaly_detection 50 MB βœ… Anomaly detection
1 MiniLM-L6-v2 sentence-transformers/all-MiniLM-L6-v2 100 MB βœ… RAG embeddings
2 scispaCy NER en_core_sci_sm 100 MB βœ… Medical entity extraction
3 Whisper Tiny openai/whisper-tiny 300 MB ❌ Voice transcription
4 BioGPT microsoft/biogpt 700 MB ❌ Report generation
5 DistilBART valhalla/distilbart-mnli-12-1 300 MB ❌ Zero-shot classification

The registry supports dynamic GPU↔CPU migration β€” when a higher-priority model needs GPU memory, the least-recently-used GPU model is evicted to CPU automatically.

NLP Pipeline

The NLP module processes clinical notes through three stages:

  1. Named Entity Recognition β€” scispaCy (en_core_sci_sm) extracts diseases, symptoms, medications, and anatomical references from patient text
  2. Zero-Shot Classification β€” DistilBART-MNLI classifies clinical text against 20 pulmonary conditions without task-specific fine-tuning (falls back to rule-based matching if the model isn't loaded)
  3. Multimodal Fusion β€” Optional BiomedVLP image-text alignment scoring correlates imaging findings with clinical narratives, with a keyword-based fallback for constrained environments

3-Tier RAG Conversational Architecture

The conversational module implements a highly resilient 3-tier Retrieval-Augmented Generation system that never fails silently:

Tier Engine Method Latency
Tier 1 Gemini 2.0 Flash (Cloud) Streaming SSE with dynamic system instructions ~1.5s
Tier 2 BioGPT (Local) Beam search decoding (num_beams=4) ~3s
Tier 3 Heuristic Templates Intent-detection rule engine with 8 intent categories ~5ms

Context construction aggregates: vision anomaly scores β†’ NLP predictions β†’ fusion similarity β†’ patient session history β†’ retrieved PubMed abstracts (via MiniLM-L6-v2 + ChromaDB HNSW indexing). All tiers prohibit dosage recommendations and append medical disclaimers.


Model Pipeline β€” VGG16 β†’ VAE β†’ ViT

Three-Stage Anomaly Detection Architecture

The core anomaly detection system implements a novel three-stage unsupervised architecture with only 2.53M trainable parameters. The model is trained exclusively on normal chest X-rays and detects anomalies by learning the distribution of healthy pulmonary anatomy β€” requiring zero pathology-specific labels.

Stage 1 β€” VGG16 Feature Extraction (0 trainable params)

Pre-trained VGG16 (ImageNet) serves as a frozen feature extractor. Convolutional feature maps are globally average-pooled to produce a compact representation per image.

Input: 224Γ—224Γ—3 RGB (ImageNet-normalized)
  β†’ VGG16.features (frozen)
  β†’ AdaptiveAvgPool2d(1,1)
  β†’ Flatten
  β†’ Output: ℝ⁡¹² feature vector

Why freeze? (i) Deterministic features ensure stable VAE training; (ii) zero gradient storage saves VRAM; (iii) ImageNet features transfer well to medical imaging (Raghu et al., 2019).

Stage 2 β€” Variational Autoencoder (1,318,656 params)

The VAE learns a smooth, continuous latent manifold of normal pulmonary anatomy. During inference, pathological images produce higher reconstruction error and KL divergence because they fall outside the learned normal distribution.

Encoder: 512 β†’ 512 β†’ 384 β†’ 256 β†’ [ΞΌ, log σ²]  (each with LayerNorm + GELU + Dropout 0.1)
         ↓
     Reparameterization: z = ΞΌ + Ρ·σ    (Ξ΅ ~ N(0,1))
         ↓
Decoder: 256 β†’ 384 β†’ 512 β†’ 512          (symmetric architecture)
         ↓
     Output: xΜ‚ (reconstructed features)

Loss function β€” Evidence Lower Bound (ELBO):

L_VAE = L_recon + Ξ² Β· L_KL

where:  L_recon = MSE(xΜ‚, x)
        L_KL   = -Β½ Ξ£(1 + log σ² - ΞΌΒ² - σ²)
        Ξ²      = 0.001  (Ξ²-VAE formulation to prevent posterior collapse)

Stage 3 β€” Vision Transformer Anomaly Scorer (1,209,729 params)

The ViT operates on the latent vector z (not raw pixels), treating it as a sequence of patches for self-attention-based anomaly scoring. This is a key architectural decision β€” the ViT scores the quality of the latent representation rather than the image directly.

z ∈ ℝ²⁡⁢ β†’ reshape to 8 patches of dim 32
  β†’ Linear projection to d_model = 128
  β†’ Prepend learnable [CLS] token
  β†’ Add positional embeddings (9 tokens = 8 patches + CLS)
  β†’ 6Γ— Transformer Blocks (8-head attention, MLP dim 512, GELU, Dropout 0.1)
  β†’ LayerNorm
  β†’ [CLS] token β†’ MLP head β†’ Sigmoid β†’ anomaly score ∈ [0, 1]
Hyperparameter Value
Latent dimension 256
Patch dimension 32
Number of patches 8
Model dimension (d_model) 128
Transformer depth 6 layers
Attention heads 8
MLP dimension 512
Dropout 0.1
Output activation Sigmoid β†’ [0, 1]

Fused Anomaly Score

The final anomaly score fuses three complementary signals via weighted linear combination after normalizing each component using calibration statistics computed on the training set:

S_anomaly = w₁ Β· Οƒ((e_recon - ΞΌ_recon) / Οƒ_recon)
          + wβ‚‚ Β· Οƒ((d_KL - ΞΌ_KL) / Οƒ_KL)
          + w₃ Β· s_ViT

where:  w₁ = 0.4  (reconstruction error β€” pixel-level deviations)
        wβ‚‚ = 0.2  (KL divergence β€” distributional shift)
        w₃ = 0.4  (ViT score β€” higher-order latent abnormalities)
        Οƒ  = sigmoid normalization

The optimal threshold of 0.348 was determined by maximizing the Youden index on the validation set.

Interpretability β€” Clinical Attention Heatmaps

To provide visual explainability, the system extracts [CLS] token attention weights from the final ViT layer:

  1. Average attention across all 8 heads β†’ patch-level attention vector
  2. Reshape into 2D grid and upsample to 384Γ—384 via bicubic interpolation
  3. Apply clinical colormap (black β†’ dark red β†’ orange β†’ bright yellow)
  4. Adaptive transparency mask ensures only anomalous regions glow over the X-ray
  5. CLAHE enhancement on base radiograph maximizes anatomical contrast

The result is a three-panel visualization: Original X-ray | Attention Heatmap | Clinical Overlay with anomaly score.


Training & Experimental Results

Dataset β€” COVID-19 Radiography Database

Class Count Usage
Normal 10,192 Training (unsupervised β€” model only sees this)
COVID-19 3,616 Evaluation only
Lung Opacity 6,012 Evaluation only
Viral Pneumonia 1,345 Evaluation only
Total 21,165 β€”

Preprocessing: 224Γ—224 Lanczos resize, RGB, ImageNet normalization. Training augmentations: random horizontal flip (p=0.5), rotation (Β±10Β°), color jitter (Β±0.2).

Two-Phase Training Protocol

Phase 1 β€” VAE Training (50 epochs)

  • AdamW optimizer (lr=1Γ—10⁻⁴, weight_decay=1Γ—10⁻⁡)
  • ReduceLROnPlateau scheduler (factor=0.5, patience=3)
  • Batch size 32, Ξ²=0.001, early stopping patience 10
  • Resource-efficient: mixed-precision FP16, gradient accumulation (4 steps β†’ effective batch 128)

Phase 2 β€” ViT Scorer Training (30 epochs)

  • AdamW optimizer (lr=5Γ—10⁻⁡, weight_decay=1Γ—10⁻⁡)
  • Binary cross-entropy: Normalβ†’0, Anomalyβ†’1
  • Only the ViT uses labels; the VAE remains fully unsupervised

Results

Metric Value
AUROC 0.718
ViT Validation Accuracy 98.6%
VAE Final Reconstruction MSE 0.0152
VAE Ξ²Β·KL Divergence 6.97Γ—10⁻⁴
True Positives (Anomaly→Anomaly) 4,974
True Negatives (Normal→Normal) 1,017
Sensitivity (Recall) 64.7%
Specificity 66.5%
Optimal Threshold 0.348
Total Trainable Parameters 2,528,385

Ablation Study β€” Fusion Component Analysis

Configuration AUROC Notes
Reconstruction error only 0.62 MSE between VGG features and reconstruction
KL divergence only 0.68 Strongest single signal
ViT score only 0.65 Latent-space attention scoring
Recon. + KL (w/o ViT) 0.69 Traditional VAE anomaly detection
Full fusion (0.4 / 0.2 / 0.4) 0.718 Best configuration

Each component provides complementary information β€” reconstruction error captures pixel-level deviations, KL divergence captures distributional shift, and the ViT captures higher-order latent abnormalities via attention.

Latent Space Validation (UMAP)

UMAP projection of the 256-dimensional VAE latent space reveals emergent clustering without supervision:

  • Normal images cluster tightly β€” the VAE learned a compact representation of healthy anatomy
  • Lung Opacity forms a distinct separable cluster β€” the most detectable anomaly class
  • Viral Pneumonia partially overlaps with normal β€” explaining its harder detectability
  • COVID-19 cases are sparse and widely distributed β€” reflecting heterogeneous radiographic presentations

Tech Stack

Backend

Component Technology
Framework FastAPI 0.110 (async, Pydantic v2)
ML Runtime PyTorch 2.2 + ONNX Runtime
NLP scispaCy, HuggingFace Transformers, BioGPT
Embeddings Sentence-Transformers (MiniLM-L6-v2)
Vector DB ChromaDB 0.4.24
Generative AI Google Gemini 2.0 Flash
Database SQLAlchemy 2.0 (SQLite dev / PostgreSQL prod)
Auth JWT + Google OAuth 2.0 (Authlib)
Task Scheduling APScheduler
PDF Generation ReportLab

Frontend

Component Technology
Framework Next.js 14 (App Router)
Styling Tailwind CSS 3.4
Animations Framer Motion 11
Charts Recharts 2.12
Icons Lucide React
HTTP Client Axios
Deployment Vercel

Infrastructure

Component Technology
Containerization Docker (Python 3.11-slim)
Backend Hosting HuggingFace Spaces (Docker SDK)
Frontend Hosting Vercel
Model Distribution HuggingFace Hub
Object Storage Cloudflare R2 (optional)
Database (Prod) Supabase PostgreSQL

Project Structure

MedSightAI/
β”œβ”€β”€ backend/
β”‚   β”œβ”€β”€ api/v1/
β”‚   β”‚   β”œβ”€β”€ routers/          # FastAPI route handlers
β”‚   β”‚   β”‚   β”œβ”€β”€ analyze.py    # X-ray upload & analysis
β”‚   β”‚   β”‚   β”œβ”€β”€ auth.py       # JWT + OAuth authentication
β”‚   β”‚   β”‚   β”œβ”€β”€ chat.py       # RAG-powered clinical Q&A
β”‚   β”‚   β”‚   β”œβ”€β”€ report.py     # PDF report generation
β”‚   β”‚   β”‚   └── users.py      # User profiles & session history
β”‚   β”‚   └── schemas/          # Pydantic v2 request/response models
β”‚   β”œβ”€β”€ core/
β”‚   β”‚   β”œβ”€β”€ config.py         # Pydantic settings (env-driven)
β”‚   β”‚   β”œβ”€β”€ security.py       # JWT, password hashing, API keys
β”‚   β”‚   β”œβ”€β”€ middleware.py      # CORS, rate limiting, security headers
β”‚   β”‚   └── exceptions.py     # Custom exception hierarchy
β”‚   β”œβ”€β”€ db/
β”‚   β”‚   β”œβ”€β”€ models/           # SQLAlchemy ORM models
β”‚   β”‚   β”œβ”€β”€ migrations/       # Alembic migration scripts
β”‚   β”‚   └── session.py        # Async database session factory
β”‚   β”œβ”€β”€ ml/
β”‚   β”‚   β”œβ”€β”€ vision/
β”‚   β”‚   β”‚   β”œβ”€β”€ pulmonary_anomaly.py  # VGG16β†’VAEβ†’ViT detector
β”‚   β”‚   β”‚   β”œβ”€β”€ anomaly.py    # ONNX ConvAE fallback
β”‚   β”‚   β”‚   └── hf_download.py # HuggingFace model auto-download
β”‚   β”‚   β”œβ”€β”€ nlp/
β”‚   β”‚   β”‚   β”œβ”€β”€ ner.py        # scispaCy medical NER
β”‚   β”‚   β”‚   β”œβ”€β”€ classifier.py # Zero-shot disease classification
β”‚   β”‚   β”‚   └── whisper.py    # Voice-to-text transcription
β”‚   β”‚   β”œβ”€β”€ rag/
β”‚   β”‚   β”‚   β”œβ”€β”€ gemini_client.py  # Gemini 2.0 Flash integration
β”‚   β”‚   β”‚   β”œβ”€β”€ generator.py  # BioGPT report + chat generation
β”‚   β”‚   β”‚   β”œβ”€β”€ retriever.py  # ChromaDB vector retrieval
β”‚   β”‚   β”‚   └── vectorstore.py # Embedding + indexing pipeline
β”‚   β”‚   β”œβ”€β”€ fusion/
β”‚   β”‚   β”‚   └── medclip.py    # Multimodal image-text alignment
β”‚   β”‚   └── registry.py       # Model lifecycle manager
β”‚   β”œβ”€β”€ orchestration/
β”‚   β”‚   β”œβ”€β”€ pipeline.py       # 7-stage analysis orchestrator
β”‚   β”‚   β”œβ”€β”€ queue.py          # Async task queue
β”‚   β”‚   β”œβ”€β”€ resilience.py     # Retry, circuit-breaker, fallbacks
β”‚   β”‚   β”œβ”€β”€ scheduler.py      # Periodic cleanup tasks
β”‚   β”‚   └── workers.py        # Background worker pool
β”‚   └── utils/
β”‚       β”œβ”€β”€ pdf.py            # Clinical PDF report builder
β”‚       β”œβ”€β”€ image.py          # Image preprocessing utilities
β”‚       β”œβ”€β”€ audio.py          # Audio format handling
β”‚       └── validators.py     # Input validation helpers
β”œβ”€β”€ frontend/
β”‚   β”œβ”€β”€ app/                  # Next.js App Router pages
β”‚   β”‚   β”œβ”€β”€ (auth)/           # Login / Registration pages
β”‚   β”‚   β”œβ”€β”€ (dashboard)/      # Analysis dashboard
β”‚   β”‚   β”œβ”€β”€ about/            # About page
β”‚   β”‚   └── profile/          # User profile & history
β”‚   β”œβ”€β”€ components/
β”‚   β”‚   β”œβ”€β”€ analysis/         # Upload panel, results viewer
β”‚   β”‚   β”œβ”€β”€ chat/             # AI chat interface
β”‚   β”‚   β”œβ”€β”€ shared/           # Navbar, layout components
β”‚   β”‚   └── ui/               # Reusable UI primitives
β”‚   └── lib/                  # API client, auth context, utilities
β”œβ”€β”€ training/
β”‚   β”œβ”€β”€ notebooks/            # Jupyter training notebooks
β”‚   └── scripts/              # Data preparation & training scripts
β”œβ”€β”€ data/                     # Raw/processed data & uploads
β”œβ”€β”€ models/                   # Cached model weights
β”œβ”€β”€ results/                  # Training outputs & evaluation
β”œβ”€β”€ Dockerfile                # Production Docker image
β”œβ”€β”€ requirements.txt          # Python dependencies
└── .env.example              # Environment variable template

Getting Started

Prerequisites

  • Python 3.10 or higher
  • Node.js 18+ and npm
  • Git and Git LFS (for model weights)
  • (Optional) CUDA 11.8+ compatible GPU for accelerated inference

1. Clone the Repository

git clone https://github.com/hoshikrana/MedSightAI.git
cd MedSightAI

2. Backend Setup

# Create virtual environment
python -m venv venv

# Activate (Windows)
.\venv\Scripts\activate

# Activate (macOS/Linux)
source venv/bin/activate

# Install PyTorch (GPU β€” CUDA 11.8)
pip install torch==2.2.0+cu118 torchvision==0.17.0+cu118 --index-url https://download.pytorch.org/whl/cu118

# OR install PyTorch (CPU-only)
pip install torch==2.2.0+cpu torchvision==0.17.0+cpu --index-url https://download.pytorch.org/whl/cpu

# Install remaining dependencies
pip install -r requirements.txt

# Install scispaCy model
pip install https://s3-us-west-2.amazonaws.com/ai2-s3-scispacy/releases/v0.5.1/en_core_sci_sm-0.5.1.tar.gz

3. Environment Configuration

# Copy the example environment file
cp .env.example .env

# Generate secure keys
python -c "import secrets; print('SECRET_KEY=' + secrets.token_hex(32))"
python -c "import secrets; print('JWT_SECRET_KEY=' + secrets.token_hex(32))"

Edit .env with your configuration. Required variables:

  • SECRET_KEY β€” Application secret (min 32 chars)
  • JWT_SECRET_KEY β€” JWT signing key (min 32 chars)
  • GEMINI_API_KEY β€” Get free API key for AI chat
  • GOOGLE_CLIENT_ID / GOOGLE_CLIENT_SECRET β€” For OAuth (optional)

4. Frontend Setup

cd frontend
npm install

Create frontend/.env.local:

NEXT_PUBLIC_API_URL=http://localhost:8000
NEXT_PUBLIC_APP_NAME=MedSight AI

5. Run the Application

# Terminal 1 β€” Backend (from project root)
python -m uvicorn backend.main:app --reload --host 0.0.0.0 --port 8000

# Terminal 2 β€” Frontend
cd frontend
npm run dev

Open http://localhost:3000 in your browser.


Configuration

Environment Variables

Variable Default Description
ENVIRONMENT development development / production / test
SECRET_KEY required Application secret key (β‰₯32 chars)
DATABASE_URL sqlite+aiosqlite:///./medsight.db Database connection string
GEMINI_API_KEY β€” Google Gemini API key for AI chat
HF_TOKEN β€” HuggingFace token for model downloads
ALLOWED_ORIGINS http://localhost:3000 Comma-separated exact frontend origins
ALLOWED_ORIGIN_REGEX https://.*\.vercel\.app Regex for Vercel preview/production origins
TRUSTED_HOSTS localhost,127.0.0.1,*.vercel.app,*.hf.space Hosts accepted by TrustedHostMiddleware
VISION_ANOMALY_BACKEND auto auto / onnx / pulmonary
GPU_VRAM_BUDGET_MB 3500 Max VRAM budget for model loading
MAX_UPLOAD_SIZE_MB 10 Maximum upload file size
STORAGE_BACKEND local local / r2 (Cloudflare R2)
RATE_LIMIT_ANALYZE 10/hour Analysis endpoint rate limit
RATE_LIMIT_CHAT 50/hour Chat endpoint rate limit

See .env.example for the complete list of configurable options.

Vision Backend Selection

The VISION_ANOMALY_BACKEND setting controls which vision model is used:

Mode Description
auto Auto-detects available checkpoints (prefers pulmonary β†’ onnx)
pulmonary Uses the VGG16→VAE→ViT .pth checkpoint
onnx Uses the ConvAE ONNX model for lightweight CPU inference

API Reference

Core Endpoints

Method Endpoint Description Auth
POST /api/v1/analyze Upload X-ray image + symptoms for analysis βœ…
GET /api/v1/analyze/status/{task_id} Poll analysis task status βœ…
GET /api/v1/analyze/result/{session_id} Retrieve completed analysis results βœ…
POST /api/v1/chat AI-powered clinical Q&A (streaming) βœ…
GET /api/v1/report/{session_id} Generate & download PDF report βœ…
GET /api/v1/health System health check ❌
GET /docs Interactive Swagger UI (dev only) ❌

Authentication Endpoints

Method Endpoint Description
POST /api/v1/auth/register Email/password registration
POST /api/v1/auth/login Email/password login β†’ JWT tokens
POST /api/v1/auth/refresh Refresh access token
GET /api/v1/auth/google Initiate Google OAuth flow
GET /api/v1/auth/google/callback Google OAuth callback

Analysis Response Schema

{
  "session_id": "uuid",
  "overall_status": "COMPLETE | PARTIAL | FAILED",
  "vision": {
    "anomaly_score": 72.5,
    "risk_level": "HIGH",
    "heatmap_base64": "data:image/png;base64,...",
    "top_regions": [{"x": 76, "y": 56, "width": 72, "height": 86, "confidence": 0.85}],
    "model_confidence": 0.82
  },
  "nlp": {
    "entities": {"diseases": [...], "symptoms": [...], "medications": [...]},
    "primary_diagnosis": "Pneumonia",
    "diagnosis_confidence": 0.78,
    "differential": [{"disease": "Pleural Effusion", "confidence": 0.45}]
  },
  "fusion": {
    "image_text_similarity": 0.72,
    "alignment": "moderate",
    "final_risk": "MEDIUM"
  },
  "report_text": "## AI Diagnostic Report ...",
  "timings": {
    "preprocess_ms": 45,
    "vision_ms": 1200,
    "nlp_ms": 350,
    "fusion_ms": 120,
    "report_ms": 800,
    "total_ms": 2515
  }
}

Deployment

Production Architecture

Service Platform Purpose
Backend API HuggingFace Spaces (Docker SDK) FastAPI + ML inference on port 7860
Frontend Vercel Next.js static + SSR
Database Supabase Managed PostgreSQL
Models HuggingFace Hub Model weight distribution
Storage Cloudflare R2 Medical image storage (optional)

Docker Deployment

# Build the production image
docker build -t medsight-ai .

# Run locally
docker run -p 7860:7860 --env-file .env medsight-ai

The Dockerfile uses python:3.11-slim, installs CPU-only PyTorch (~800MB smaller than CUDA), and runs Uvicorn with a single worker. Peak memory is approximately 4GB during inference.

HuggingFace Spaces

The backend is configured to deploy directly to HuggingFace Spaces via the Docker SDK. The HuggingFace metadata is in the README.md frontmatter. Models are auto-downloaded from hoshikrana/VAE_and_VIT_Anomaly_detection on startup.

Required GitHub repository secrets for the deployment workflow:

Secret Purpose
HF_TOKEN Hugging Face write token for uploading the Space and runtime model downloads
HF_SPACE_ID Space repo id, for example username/medsight-ai-backend
HF_SPACE_URL Public backend URL, for example https://username-medsight-ai-backend.hf.space
VERCEL_TOKEN Vercel CLI token
VERCEL_ORG_ID Vercel team/user id
VERCEL_PROJECT_ID Vercel project id for the frontend

Set these Hugging Face Space runtime variables as secrets or variables:

ENVIRONMENT=production
SECRET_KEY=<64-hex-or-long-random-secret>
JWT_SECRET_KEY=<different-64-hex-or-long-random-secret>
DATABASE_URL=<production-postgres-url-or-sqlite-for-demo-only>
ALLOWED_ORIGINS=https://<your-vercel-domain>
ALLOWED_ORIGIN_REGEX=https://.*\.vercel\.app
TRUSTED_HOSTS=*.hf.space,localhost,127.0.0.1
FRONTEND_URL=https://<your-vercel-domain>
BACKEND_URL=https://<your-hf-space-subdomain>.hf.space
HF_TOKEN=<token-if-model-repo-is-private>

Research Paper

This project is accompanied by a peer-reviewed research paper:

"MedSight AI: A Multimodal Deep Learning Framework for Unsupervised Pulmonary Anomaly Detection with Retrieval-Augmented Clinical Decision Support"

Kasala Hoshik, V. Vineel Reddy, K. Chanikya Lovely Professional University, Phagwara, Punjab, India Research | May 2026

Key Research Contributions

  1. Novel three-stage architecture (VGG16 β†’ VAE β†’ ViT) β€” Decomposes anomaly detection into feature extraction, distributional learning, and attention-based scoring with only 2.53M trainable parameters (vs. 86M in ViT-Base or 307M in DINOv2)
  2. Unsupervised paradigm shift β€” Trained exclusively on normal radiographs, eliminating the need for expensive per-pathology annotation. Can detect novel/rare pathologies absent from training data
  3. Multi-signal interpretable scoring β€” Fusion of reconstruction error, KL divergence, and ViT attention provides clinicians with three complementary perspectives on why an image was flagged
  4. UMAP-validated latent representations β€” Emergent clustering in the VAE latent space demonstrates pathology-relevant structure without any supervised signal
  5. Production-grade multimodal system β€” Complete clinical platform integrating vision, NLP, and 3-tier RAG conversational AI with graceful degradation when individual components fail
  6. Resource-constrained deployment β€” Full pipeline operates within 4 GB VRAM, enabling deployment on consumer hardware and CPU-only environments

Strengths Highlighted in the Paper

  • Clinical viability β€” AUROC of 0.718 demonstrates unsupervised detection can provide clinically useful screening as a triage tool
  • Extreme parameter efficiency β€” 2.53M params vs. 86M (ViT-Base) or 307M (DINOv2)
  • Interpretable multi-signal scoring β€” Three complementary anomaly signals provide richer diagnostic information than single-metric approaches

Future Directions

  • Perceptual loss (instead of MSE) for VAE reconstruction to better capture structural anomalies
  • Larger backbones (DINOv2 ViT-S/14 producing 384-d features)
  • Multi-scale latent analysis using hierarchical VAEs
  • Contrastive pre-training of the anomaly scorer
  • Domain-specific backbones (CheXNet) for improved viral pneumonia sensitivity

Reproduce Training

See Training & Experimental Results above for full methodology and hyperparameters.

# Prepare and preprocess the dataset
python training/scripts/prepare_dataset.py

# Train the VAE + ViT anomaly detector
python training/scripts/train_anomaly.py

# Or use the Jupyter notebook for interactive training
jupyter notebook training/notebooks/covid\ \(1\).ipynb

# Upload trained models to HuggingFace
python training/scripts/upload_models.py

Contributing

We welcome contributions! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/amazing-feature)
  3. Make your changes and ensure tests pass
  4. Commit with descriptive messages (git commit -m 'Add amazing feature')
  5. Push to your branch (git push origin feature/amazing-feature)
  6. Open a Pull Request

Development Guidelines

  • Backend: Follow ruff and black formatting (see pyproject.toml)
  • Frontend: Follow ESLint + Prettier configuration
  • Tests: Add tests for new features (pytest for backend, npm test for frontend)
  • Commits: Use conventional commit messages

Running Tests

# Backend tests
pytest backend/tests/ -v --tb=short

# With specific markers
pytest -m "unit" -v        # Fast unit tests only
pytest -m "integration" -v # Integration tests
pytest -m "ml" -v          # ML model tests

# Frontend lint
cd frontend && npm run lint

Acknowledgements


License

This project is licensed under the Apache License 2.0 β€” see the LICENSE file for details.


Built with ❀️ for advancing medical AI research

MedSight AI is a research project and should not be used for clinical diagnosis without physician oversight.