--- license: mit tags: - biology - protein-classification - microalgae - genomics - nanoGPT - metagenomics - tara-oceans pipeline_tag: text-classification --- # algaGPT Causal language model for binary classification of microalgal vs. contaminant protein sequences. ## Model Description - **Architecture:** nanoGPT (Andrej Karpathy) - **Task:** Binary classification of microalgal protein sequences via next-token prediction - **Mode:** TI-inclusive (full-length sequences) - **Training data:** ~58.6M protein sequences (1:1 algal:contaminant ratio) - **Algal sources:** 166 microalgal genomes across 10 phyla - **Contaminant sources:** Bacterial, archaeal, and fungal sequences from NCBI nr ## Performance | Metric | Score | |--------|-------| | Recall | >99% | | Speed vs. BLASTp | ~10,701x faster | ## Usage **Input:** Protein sequence (amino acid string) **Output:** Classification tag (algal/contaminant) via next-token prediction ## Applications algaGPT was used as the primary proteome extraction tool in the ELF-NET study (Nelson et al., forthcoming), where it purified algal protein sequences from 2,044 TARA Oceans metagenome assemblies, yielding 221.9 million sequences for downstream domain-environment coupling analysis. ## Authors David R. Nelson, Ashish Kumar Jaiswal, Noha Samir Ismail, Alexandra Mystikou, Kourosh Salehi-Ashtiani Green Genomics Lab, New York University Abu Dhabi ## Citation ```bibtex @article{la4sr2025, title={Pan-microalgal dark proteome mapping via interpretable deep learning and synthetic chimeras}, author={Nelson, David R. and Jaiswal, Ashish Kumar and Ismail, Noha Samir and Mystikou, Alexandra and Salehi-Ashtiani, Kourosh}, journal={Patterns}, volume={6}, pages={101373}, year={2025}, publisher={Cell Press}, doi={10.1016/j.patter.2025.101373} } ``` ## Contact Kourosh Salehi-Ashtiani — ksa3@nyu.edu