# AI Handover Document - BioFlow Project ## 1. Project Context & Strategic Vision ### Project Name **BioFlow** (part of the OpenBioMed ecosystem) ### Mission Statement Build a **fully open-source** AI-powered biological discovery platform that unifies fragmented R&D data (text, sequences, molecules, structures) into an intelligent exploration and design engine. ### Target Use Case (UC4) **"Multimodal Biological Design & Discovery Intelligence"** - Ingest and index multimodal biological data - Enable cross-modal similarity search (text ↔ molecule ↔ protein) - Suggest "close but diverse" variants for design exploration - Provide scientific evidence linking and traceability ### ⚠️ Critical Constraint **This project must be 100% open-source.** InstaDeep models (ProtBFN, AbBFN2, Laila, DeepChain, InstaNovo) are proprietary and **cannot** be used. All models must come from open repositories (HuggingFace, GitHub, etc.). --- ## 2. Where Does OBM (OpenBioMed) Stand? ### OBM's Role in the Architecture OBM is the **multimodal embedding backbone** - it is ONE tool among several in the BioFlow platform: ``` ┌────────────────────────────────────────────────────────────────────────┐ │ BioFlow Platform │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ OBM (Embedding Layer) │ │ │ │ ┌──────────┐ ┌──────────────┐ ┌──────────────┐ │ │ │ │ │ PubMedBERT│ │ ChemBERTa │ │ ESM-2 │ │ │ │ │ │ (Text) │ │ (Molecules) │ │ (Proteins) │ │ │ │ │ └─────┬─────┘ └──────┬───────┘ └──────┬───────┘ │ │ │ │ └───────────────┼─────────────────┘ │ │ │ │ ▼ │ │ │ │ Unified 768-dim Embeddings │ │ │ └──────────────────────────┬───────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────▼───────────────────────────────────────┐ │ │ │ Qdrant (Vector Memory) │ │ │ │ • HNSW indexing for fast similarity search │ │ │ │ • Payload storage (metadata, source, tags) │ │ │ │ • Filtered retrieval by modality/source │ │ │ └──────────────────────────┬───────────────────────────────────────┘ │ │ │ │ │ ┌──────────────────────────▼───────────────────────────────────────┐ │ │ │ Agent Pipeline │ │ │ │ ┌─────────┐ ┌───────────┐ ┌─────────┐ ┌─────────┐ │ │ │ │ │ Miner │ │ Generator │ │Validator│ │ Ranker │ │ │ │ │ │(Lit.) │ │(DeepPurpose│ │(Toxicity│ │(MMR/Div)│ │ │ │ │ └─────────┘ │ MolT5) │ │ KG) │ └─────────┘ │ │ │ │ └───────────┘ └─────────┘ │ │ │ └──────────────────────────────────────────────────────────────────┘ │ │ │ │ ┌──────────────────────────────────────────────────────────────────┐ │ │ │ Next.js UI │ │ │ │ • Search interface (text/SMILES/sequence) │ │ │ │ • 3D visualization (embedding space) │ │ │ │ • Evidence linking & traceability │ │ │ └──────────────────────────────────────────────────────────────────┘ │ └────────────────────────────────────────────────────────────────────────┘ ``` ### What OBM Provides | Component | Model | Source | Purpose | |-----------|-------|--------|---------| | TextEncoder | PubMedBERT | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext` | Encode biomedical abstracts | | MoleculeEncoder | ChemBERTa | `seyonec/ChemBERTa-zinc-base-v1` | Encode SMILES molecules | | ProteinEncoder | ESM-2 | `facebook/esm2_t12_35M_UR50D` | Encode protein sequences | ### What OBM Does NOT Provide - **Generation**: Molecule/protein design (use DeepPurpose, MolT5, ESMFold) - **Validation**: Toxicity/ADMET prediction (use external KGs or predictors) - **Orchestration**: Workflow management (use BioFlow pipeline) --- ## 3. Full Architecture & Components ### Layer 1: Data Ingestion | Data Source | Type | Format | |-------------|------|--------| | PubMed | Text | Abstracts (JSON/XML) | | UniProt | Protein | FASTA sequences | | ChEMBL | Molecule | SMILES strings | | BioImage Archive | Image | (Future: CLIP embeddings) | ### Layer 2: Embedding (OBM) - **bioflow/plugins/obm_encoder.py**: Central encoder class - **bioflow/plugins/encoders/**: Modality-specific encoders - `text_encoder.py` → PubMedBERT - `molecule_encoder.py` → ChemBERTa / RDKit fingerprints - `protein_encoder.py` → ESM-2 - **bioflow/obm_wrapper.py**: High-level API (`encode_text`, `encode_smiles`, `encode_protein`) ### Layer 3: Vector Storage (Qdrant) - **bioflow/qdrant_manager.py**: Low-level Qdrant operations - **bioflow/api/qdrant_service.py**: API-level service with caching - Collections: `molecules`, `proteins`, `texts` (or unified `bioflow_memory`) ### Layer 4: Agent Pipeline - **bioflow/pipeline.py**: Workflow orchestration - `MinerAgent`: Literature retrieval - `ValidatorAgent`: Safety/toxicity checks - `RankerAgent`: MMR-based diversification - **bioflow/api/deeppurpose_api.py**: DTI prediction endpoints ### Layer 5: API Server - **bioflow/api/server.py**: FastAPI application - Endpoints: `/health`, `/api/molecules`, `/api/proteins`, `/api/search`, `/api/points`, `/api/collections` ### Layer 6: Frontend - **ui/**: Next.js 16 application - Pages: Discovery, Explorer, Molecules, Proteins - Mock fallbacks in `ui/app/api/_mock/` --- ## 4. Current Implementation Status ### ✅ Implemented | Component | Status | Notes | |-----------|--------|-------| | OBMEncoder | ✅ Working | PubMedBERT + ChemBERTa + ESM-2 | | Qdrant Integration | ✅ Working | Local storage at `./qdrant_data` | | FastAPI Server | ✅ Working | Port 8000 | | Next.js UI | ✅ Working | Port 3000 | | DeepPurpose Integration | ⚠️ Optional | Requires `DeepPurpose` package | | Mock Fallbacks | ✅ Implemented | For offline/demo mode | ### ❌ Not Yet Implemented | Feature | Priority | Notes | |---------|----------|-------| | Image Embeddings | Medium | Need BioMedCLIP or similar | | MMR Diversification | High | In RankerAgent (basic) | | Evidence Linking | High | Need to add source tracking | | PubMed Ingestion | High | Need data pipeline | | UniProt Ingestion | High | Need data pipeline | | ChEMBL Ingestion | Medium | Need data pipeline | | 3D Visualization | Medium | `/api/points` endpoint ready | --- ## 5. How to Run ### Prerequisites ```bash # Python 3.9+ pip install torch transformers qdrant-client fastapi uvicorn # Optional (for full functionality) pip install rdkit-pypi DeepPurpose # Node.js 18+ npm install -g pnpm ``` ### Start Backend ```bash cd c:\Users\ramit\OneDrive\Bureau\Github\OpenBioMed python -m uvicorn bioflow.api.server:app --host 0.0.0.0 --port 8000 ``` ### Start Frontend ```bash cd ui pnpm install pnpm dev ``` --- ## 6. Open-Source Models Used | Task | Model | License | HuggingFace Path | |------|-------|---------|------------------| | Text Embedding | PubMedBERT | MIT | `microsoft/BiomedNLP-PubMedBERT-base-uncased-abstract-fulltext` | | Text Embedding | SciBERT | Apache 2.0 | `allenai/scibert_scivocab_uncased` | | Molecule Embedding | ChemBERTa | MIT | `seyonec/ChemBERTa-zinc-base-v1` | | Protein Embedding | ESM-2 | MIT | `facebook/esm2_t12_35M_UR50D` | | DTI Prediction | DeepPurpose | BSD | [GitHub](https://github.com/kexinhuang12345/DeepPurpose) | | Molecule Generation | MolT5 | Apache 2.0 | `laituan245/molt5-base` | | Protein Folding | ESMFold | MIT | `facebook/esmfold_v1` | **NO PROPRIETARY MODELS FROM INSTADEEP ARE USED.** --- ## 7. Next Steps for AI Assistant 1. **Data Ingestion Pipelines**: Create scripts to ingest PubMed, UniProt, ChEMBL data 2. **Evidence Linking**: Add source tracking to all search results 3. **MMR Diversification**: Implement proper Maximal Marginal Relevance in RankerAgent 4. **Image Support**: Integrate BioMedCLIP for bioimaging data 5. **Evaluation Metrics**: Implement Recall@10, MRR, nDCG for retrieval quality 6. **Batch Processing**: Optimize for large-scale data ingestion