# BioFlow: AI-Powered Drug-Target Interaction Platform
## Technical Report - January 2026

---

## Table of Contents
1. [Executive Summary](#executive-summary)
2. [System Architecture](#system-architecture)
3. [Core Technologies](#core-technologies)
4. [Pipeline Implementation](#pipeline-implementation)
5. [Model Training & Results](#model-training--results)
6. [Qdrant Vector Database Integration](#qdrant-vector-database-integration)
7. [FastAPI Backend](#fastapi-backend)
8. [Frontend Application](#frontend-application)
9. [Langflow Integration](#langflow-integration)
10. [Current Status](#current-status)
11. [Future Roadmap](#future-roadmap)

---

## Executive Summary

**BioFlow** is an end-to-end AI-powered drug discovery platform designed for predicting Drug-Target Interactions (DTI). The system combines deep learning models (DeepPurpose), vector similarity search (Qdrant), and a modern React-based frontend to enable researchers to:

- Train and evaluate DTI prediction models on benchmark datasets
- Perform similarity search across drug-target embedding space
- Visualize molecular structures in 2D and 3D
- Build visual pipelines using Langflow for no-code experimentation

### Key Achievements
- ✅ Trained models on **3 benchmark datasets** (KIBA, DAVIS, BindingDB_Kd)
- ✅ Best Concordance Index (CI): **0.805** on BindingDB_Kd
- ✅ Indexed **23,531 drug-target pairs** in Qdrant vector database
- ✅ Real-time similarity search via FastAPI backend
- ✅ Interactive 2D/3D molecular visualization
- ✅ Langflow pipeline for no-code DTI prediction

---

## System Architecture

```
┌─────────────────────────────────────────────────────────────────────────────┐
│                              BioFlow Architecture                            │
├─────────────────────────────────────────────────────────────────────────────┤
│                                                                             │
│   ┌──────────────┐     ┌──────────────┐     ┌──────────────────────────┐   │
│   │   Frontend   │────▶│  FastAPI     │────▶│      Qdrant              │   │
│   │   (Next.js)  │     │  Backend     │     │   Vector Database        │   │
│   │   Port 3000  │     │  Port 8001   │     │   Port 6333              │   │
│   └──────────────┘     └──────────────┘     └──────────────────────────┘   │
│          │                    │                        ▲                    │
│          │                    │                        │                    │
│          ▼                    ▼                        │                    │
│   ┌──────────────┐     ┌──────────────┐     ┌─────────┴──────────────┐     │
│   │  3Dmol.js    │     │ DeepPurpose  │     │   Ingestion Pipeline   │     │
│   │  Smiles-     │     │   Model      │     │   (ingest_qdrant.py)   │     │
│   │  Drawer      │     │   (PyTorch)  │     └────────────────────────┘     │
│   └──────────────┘     └──────────────┘                                    │
│                               │                                             │
│                               ▼                                             │
│                        ┌──────────────┐     ┌────────────────────────┐     │
│                        │   Langflow   │────▶│   Visual Pipeline      │     │
│                        │   Port 7860  │     │   (DTI Orchestrator)   │     │
│                        └──────────────┘     └────────────────────────┘     │
│                                                                             │
└─────────────────────────────────────────────────────────────────────────────┘
```

---

## Core Technologies

### Backend Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| ML Framework | **DeepPurpose** (PyTorch) | Drug-Target Interaction prediction |
| Vector DB | **Qdrant** | Similarity search on embeddings |
| API Server | **FastAPI** + Uvicorn | REST API for frontend |
| Data Source | **TDC (Therapeutics Data Commons)** | Benchmark DTI datasets |
| Pipeline UI | **Langflow** | No-code visual pipeline builder |

### Frontend Stack
| Component | Technology | Purpose |
|-----------|------------|---------|
| Framework | **Next.js 16** (App Router) | React server components |
| UI Library | **Shadcn/UI** + Radix | Component system |
| Styling | **Tailwind CSS** | Utility-first CSS |
| 3D Viz | **3Dmol.js** | Protein structure viewer |
| 2D Viz | **smiles-drawer** | Molecule structure rendering |
| Package Manager | **pnpm** | Fast, disk-efficient |

---

## Pipeline Implementation

### 1. Training Pipeline (`deeppurpose002.py`)

The training script is a comprehensive CLI tool that:

```bash
python deeppurpose002.py --dataset KIBA --epochs 10 --drug_enc Morgan --target_enc CNN
```

**Features:**
- Automatic GPU detection (CUDA support)
- Multiple dataset support: DAVIS, KIBA, BindingDB_Kd, BindingDB_Ki, BindingDB_IC50
- Label transformation: `paffinity_nm` (converts nM to -log10 scale)
- Comprehensive metrics: MSE, RMSE, MAE, Pearson, Spearman, Concordance Index
- Automatic visualization generation (scatter plots, residuals, sorted curves)

**Encoding Configuration:**
```python
MODEL_CONFIG = {
    "drug_encoding": "Morgan",      # Morgan fingerprints (1024-bit)
    "target_encoding": "CNN",       # CNN for protein sequences
    "cls_hidden_dims": [1024, 1024, 512],
    "hidden_dim_drug": 128,
    "hidden_dim_protein": 128,
}
```

### 2. Ingestion Pipeline (`ingest_qdrant.py`)

Converts trained model embeddings into searchable vectors:

```
[1/6] Load Model (model.pt + config.pkl)
[2/6] Load Dataset from TDC (KIBA test split)
[3/6] Generate Embeddings (no shuffle to preserve order)
[4/6] Compute PCA projections (drug, target, combined)
[5/6] Connect to Qdrant (localhost:6333)
[6/6] Upload points with payloads
```

**Vector Schema:**
```python
vectors_config = {
    "drug": VectorParams(size=128, distance=Distance.COSINE),
    "target": VectorParams(size=128, distance=Distance.COSINE),
}
```

**Payload Structure:**
```json
{
  "smiles": "CC(=O)OC1=CC=CC=C1C(=O)O",
  "target_seq": "MKTAYIAK...",
  "label_true": 7.2,
  "pca_drug": [0.12, -0.34, 0.56],
  "pca_target": [-0.21, 0.78, 0.11],
  "pca_combined": [0.45, -0.12, 0.67],
  "affinity_class": "high"  // high: >7, medium: 5-7, low: <5
}
```

---

## Model Training & Results

### Benchmark Performance

| Dataset | Samples | CI | Pearson | MSE | Training Time |
|---------|---------|-----|---------|-----|---------------|
| **BindingDB_Kd** | 42,227 | **0.805** | 0.768 | 0.667 | 1h 49m |
| **KIBA** | 117,656 | 0.703 | 0.522 | 0.0008 | 3h 42m |
| **DAVIS** | 25,772 | 0.786 | 0.545 | 0.468 | 9m |

### Best Model Configuration
- **Selected Run:** `20260125_104915_KIBA`
- **Hardware:** NVIDIA GeForce RTX 3070 Laptop GPU
- **Epochs:** 10
- **Batch Size:** 256
- **Learning Rate:** 1e-4
- **Split:** 80/10/10 (train/val/test)

### Metrics Explanation
- **Concordance Index (CI):** Probability that predictions preserve true ordering (0.5 = random, 1.0 = perfect)
- **Pearson Correlation:** Linear correlation between true and predicted values
- **MSE:** Mean Squared Error (lower is better)

---

## Qdrant Vector Database Integration

### Collection: `bio_discovery`

**Statistics:**
- Total Vectors: **23,531** drug-target pairs
- Vector Dimensions: 128 (drug) + 128 (target)
- Distance Metric: Cosine Similarity
- Pre-computed PCA: 3D projections for visualization

### Search Capabilities

**1. Drug Similarity Search**
```python
# Input: SMILES string
query = "CC(=O)Nc1ccc(O)cc1"  # Acetaminophen
# Output: Top-K similar drugs by Morgan fingerprint embedding
```

**2. Target Similarity Search**
```python
# Input: Protein sequence
query = "MKTAYIAKQRQISFVKSHFSRQLE..."
# Output: Top-K similar targets by CNN embedding
```

**3. Text Search (Fallback)**
```python
# Input: Partial SMILES or keyword
# Output: Substring matches in payload
```

### API Endpoints

| Endpoint | Method | Description |
|----------|--------|-------------|
| `/api/search` | POST | Vector similarity search |
| `/api/points` | GET | Get points for 3D visualization |
| `/api/stats` | GET | Collection statistics |
| `/health` | GET | Service health check |

---

## FastAPI Backend

### Server: `server/api.py`

**Startup Sequence:**
```
[STARTUP] Loading DeepPurpose model...
[STARTUP] Model loaded from runs\20260125_104915_KIBA\model.pt
[STARTUP] Using device: cuda
[STARTUP] Connecting to Qdrant...
[STARTUP] Connected. Collections: ['bio_discovery']
[STARTUP] Ready!
```

**Key Features:**
1. **Model Caching:** Model loaded once at startup (not per-request)
2. **Device Override:** Fixes DeepPurpose's global device variable for GPU inference
3. **CORS Enabled:** Allows frontend on port 3000
4. **Error Handling:** Fallback to text search if encoding fails

### Direct Encoding (No data_process)

The API uses direct encoding to avoid DeepPurpose's `data_process` overhead:

```python
# Drug encoding (Morgan fingerprints)
from DeepPurpose.utils import smiles2morgan
morgan_fp = smiles2morgan(smiles, radius=2, nBits=1024)
vector = model.model.model_drug(torch.tensor([morgan_fp]))

# Target encoding (CNN)
from DeepPurpose.utils import trans_protein
target_encoding = trans_protein(sequence)
vector = model.model.model_protein(torch.tensor([target_encoding]))
```

---

## Frontend Application

### Page Structure

```
ui/app/
├── page.tsx                 # Landing page
├── layout.tsx               # Root layout (ThemeProvider)
└── dashboard/
    ├── page.tsx             # Dashboard home
    ├── discovery/           # Drug discovery search
    │   └── page.tsx
    ├── explorer/            # Data exploration
    │   ├── page.tsx
    │   ├── chart.tsx
    │   └── components.tsx
    ├── molecules-2d/        # 2D molecule viewer
    │   ├── page.tsx
    │   └── _components/
    │       └── Smiles2DViewer.tsx
    ├── molecules-3d/        # 3D molecule viewer
    │   ├── page.tsx
    │   └── _components/
    │       └── Molecule3DViewer.tsx
    └── proteins-3d/         # 3D protein viewer
        ├── page.tsx
        └── _components/
            └── ProteinViewer.tsx
```

### Key Components

**1. Discovery Page**
- Input: SMILES or protein sequence
- Search types: Similarity, Text
- Results: Ranked list with affinity scores

**2. Molecules 2D Viewer**
- Renders molecules using `smiles-drawer`
- Supports common molecules (Caffeine, Aspirin, etc.)
- Copy SMILES functionality

**3. Proteins 3D Viewer**
- Uses `3Dmol.js` for WebGL rendering
- Fetches PDB files from RCSB
- Multiple representation styles (cartoon, surface, stick)

**4. Explorer**
- 3D scatter plot of embedding space
- Color-coded by affinity class
- Interactive point selection

---

## Langflow Integration

### Purpose

Langflow provides a **no-code visual interface** for building DTI prediction pipelines. It allows researchers without coding experience to:

1. Create drug-target interaction workflows
2. Chain API calls visually
3. Filter results based on affinity thresholds
4. Export predictions

### Pipeline Configuration

File: `langflow/bioflow_dti_pipeline.json`

```
┌─────────────────┐     ┌─────────────────┐     ┌─────────────────┐
│  Drug Input     │────▶│   DeepPurpose   │────▶│    Qdrant       │
│  (SMILES)       │     │   Encoder       │     │  Vector Store   │
└─────────────────┘     └─────────────────┘     └─────────────────┘
                                                        │
┌─────────────────┐                                     ▼
│  Target Input   │────▶                        ┌─────────────────┐
│  (Protein Seq)  │                             │  Affinity       │
└─────────────────┘                             │  Filter (>0.8)  │
                                                └─────────────────┘
                                                        │
                                                        ▼
                                                ┌─────────────────┐
                                                │  DTI Results    │
                                                │  Output         │
                                                └─────────────────┘
```

### Running Langflow

```bash
# Start Langflow server
.\.venv\Scripts\langflow run --host 0.0.0.0 --port 7860

# Or use the dedicated venv
.\langflow_venv\Scripts\langflow run --host 0.0.0.0 --port 7860
```

**Note:** Langflow requires a separate virtual environment due to dependency conflicts with DeepPurpose.

---

## Current Status

### ✅ Completed Features

| Feature | Status | Notes |
|---------|--------|-------|
| DeepPurpose Training Pipeline | ✅ Done | 3 datasets trained |
| Qdrant Ingestion | ✅ Done | 23,531 vectors indexed |
| FastAPI Backend | ✅ Done | Running on port 8001 |
| Vector Search API | ✅ Done | Drug/Target similarity |
| Next.js Frontend | ✅ Done | 6 pages implemented |
| 2D Molecule Viewer | ✅ Done | smiles-drawer integration |
| 3D Protein Viewer | ✅ Done | 3Dmol.js integration |
| Dark Mode | ✅ Done | next-themes provider |
| Langflow Pipeline | ✅ Done | JSON config ready |

### 🚧 Partially Complete

| Feature | Status | Notes |
|---------|--------|-------|
| 3D Molecule Viewer | 🚧 WIP | SDF fetching needs work |
| Explorer Visualization | 🚧 WIP | Chart rendering issues |
| Data Page | 🚧 WIP | API stats integration |

### ❌ Not Yet Implemented

| Feature | Priority | Description |
|---------|----------|-------------|
| OpenBioMed Integration | High | Multi-modal foundation model |
| User Authentication | Medium | Login/session management |
| Batch Predictions | Medium | Upload CSV for bulk inference |
| Model Fine-tuning UI | Low | Retrain on custom data |
| Export Results | Low | CSV/JSON download |

---

## Future Roadmap

### Phase 1: OpenBioMed Integration (High Priority)

[OpenBioMed](https://github.com/PharMolix/OpenBioMed) is a multi-modal foundation model for biomedicine that would significantly enhance BioFlow's capabilities:

**Planned Features:**
1. **Molecule-Text Alignment**
   - Search drugs using natural language descriptions
   - Example: "Find molecules similar to aspirin that reduce inflammation"

2. **Protein-Text Alignment**
   - Describe targets in plain English
   - Example: "Kinase involved in cancer cell proliferation"

3. **Cross-Modal Retrieval**
   - Find drugs for a given text description
   - Find targets for a given drug structure

4. **Enhanced Embeddings**
   - Replace Morgan/CNN with transformer-based encoders
   - Better generalization to novel compounds

**Implementation Plan:**
```python
# Replace current encoding
# FROM: Morgan fingerprints + CNN
# TO: OpenBioMed's BioMedGPT encoder

from openbiomedgpt import BioMedGPTEncoder
encoder = BioMedGPTEncoder.load_pretrained("biomedgpt-base")

# Multi-modal embedding
drug_embedding = encoder.encode_molecule(smiles)
target_embedding = encoder.encode_protein(sequence)
text_embedding = encoder.encode_text("kinase inhibitor")
```

### Phase 2: Advanced Search & Filtering

1. **Faceted Search**
   - Filter by molecular weight, logP, TPSA
   - Filter by target family (kinases, GPCRs, etc.)

2. **ADMET Predictions**
   - Absorption, Distribution, Metabolism, Excretion, Toxicity
   - Integrate with ADMETlab 2.0

3. **Structure-Activity Relationship (SAR)**
   - Identify key structural features
   - Scaffold hopping suggestions

### Phase 3: Collaboration Features

1. **Project Workspaces**
   - Save searches and results
   - Share with team members

2. **Annotation System**
   - Tag molecules with notes
   - Track experimental validation

3. **Integration with Lab Notebooks**
   - Export to ELN systems
   - Import experimental data

---

## Running the System

### Quick Start

```powershell
# 1. Start Qdrant (Docker)
docker run -p 6333:6333 -p 6334:6334 qdrant/qdrant

# 2. Start Backend API
.\.venv\Scripts\python -m uvicorn server.api:app --host 0.0.0.0 --port 8001

# 3. Start Frontend
cd ui
pnpm dev

# 4. (Optional) Start Langflow
.\langflow_venv\Scripts\langflow run --host 0.0.0.0 --port 7860
```

### Service Ports
| Service | Port | URL |
|---------|------|-----|
| Frontend | 3000 | http://localhost:3000 |
| Backend API | 8001 | http://localhost:8001 |
| Qdrant | 6333 | http://localhost:6333 |
| Langflow | 7860 | http://localhost:7860 |

### Health Check
```powershell
# Check all services
$qdrant = netstat -ano | Select-String ":6333.*LISTENING"
$api = netstat -ano | Select-String ":8001.*LISTENING"
$ui = netstat -ano | Select-String ":3000.*LISTENING"

Write-Host "Qdrant: $($qdrant -ne $null)"
Write-Host "API: $($api -ne $null)"
Write-Host "UI: $($ui -ne $null)"
```

---

## File Structure

```
lacoste001/
├── config.py                    # Shared configuration
├── deeppurpose002.py            # Training pipeline
├── ingest_qdrant.py             # Vector ingestion
├── runs/                        # Model checkpoints & results
│   ├── 20260125_080409_BindingDB_Kd/
│   ├── 20260125_104915_KIBA/    # Best model ★
│   └── 20260126_160009_DAVIS/
├── server/
│   └── api.py                   # FastAPI backend
├── langflow/
│   └── bioflow_dti_pipeline.json
├── data/
│   ├── davis.tab
│   └── kiba.tab
└── ui/                          # Next.js frontend
    ├── app/
    │   ├── layout.tsx
    │   ├── page.tsx
    │   └── dashboard/
    │       ├── discovery/
    │       ├── explorer/
    │       ├── molecules-2d/
    │       ├── molecules-3d/
    │       └── proteins-3d/
    ├── components/
    └── lib/
```

---

## Conclusion

BioFlow demonstrates a complete pipeline for AI-powered drug discovery, from model training to interactive visualization. The system successfully:

1. **Trains** DTI prediction models achieving CI > 0.80
2. **Indexes** embeddings for fast similarity search
3. **Serves** predictions via REST API
4. **Visualizes** molecules and proteins in the browser
5. **Enables** no-code experimentation via Langflow

The next major milestone is **OpenBioMed integration**, which will unlock multi-modal search and dramatically improve the user experience for drug discovery researchers.

---

*Report generated: January 26, 2026*  
*Repository: github.com/hamzasammoud11-dotcom/lacoste001*  
*Branch: core-progress*