ESMC Protein Function

ESMC Protein Function Banner

An Evolutionary-scale Model (ESM) for protein function prediction from amino acid sequences using the Gene Ontology (GO). Based on the ESM Cambrian Transformer architecture, pre-trained on UniRef, MGnify, and the Joint Genome Institute's database and fine-tuned on the AmiGO Boost protein function dataset, this protein language model predicts the GO subgraph for a particular protein sequence - giving you insight into the molecular function, biological process, and location of the activity inside the cell.

Key Features

  • Sequence-to-function prediction — Predicts Molecular Function, Biological Process, and Cellular Component ontologies directly from raw amino acid sequences, eliminating the need for homology searches, structural data, or multiple sequence alignments.

  • Hierarchy-aware GO subgraph reconstruction — Outputs a full GO directed acyclic graph (DAG) ensuring predictions respect the ontology structure rather than treating each term as an independent binary label.

  • Efficient inference at scale — Supports weight quantization and quantization-aware training (QAT), enabling memory-efficient, high-throughput screening of large sequence datasets without accuracy loss.

What are GO terms?

"The Gene Ontology (GO) is a concept hierarchy that describes the biological function of genes and gene products at different levels of abstraction (Ashburner et al., 2000). It is a good model to describe the multi-faceted nature of protein function."

"GO is a directed acyclic graph. The nodes in this graph are functional descriptors (terms or classes) connected by relational ties between them (is_a, part_of, etc.). For example, terms 'protein binding activity' and 'binding activity' are related by an is_a relationship; however, the edge in the graph is often reversed to point from binding towards protein binding. This graph contains three subgraphs (subontologies): Molecular Function (MF), Biological Process (BP), and Cellular Component (CC), defined by their root nodes. Biologically, each subgraph represent a different aspect of the protein's function: what it does on a molecular level (MF), which biological processes it participates in (BP) and where in the cell it is located (CC)."

From CAFA 5 Protein Function Prediction

Code Repository

The code repository can be found at https://github.com/andrewdalpino/ESMC-Protein-Function.

V1 Pretrained Models

The following pretrained models are available on HuggingFace Hub and require the esmc-protein-function library version 1.x.x for inference. All V1 models have been optimized with quantization-aware post-training.

Name Embedding Dimensions Encoder Layers Context Length Total Parameters
andrewdalpino/ESMC-Protein-Function-V1-300M 960 30 2048 397M
andrewdalpino/ESMC-Protein-Function-V1-600M 1152 36 2048 661M

V0 Pretrained Models

The following pretrained models are available on HuggingFace Hub and require the esmc_function_classifier library version 0.1.x for inference.

Name Embedding Dimensions Encoder Layers Context Length Total Parameters
andrewdalpino/ESMC-Protein-Function-V0-300M 960 30 2048 361M
andrewdalpino/ESMC-Protein-Function-V0-300M-QAT 960 30 2048 361M
andrewdalpino/ESMC-Protein-Function-V0-600M 1152 36 2048 644M
andrewdalpino/ESMC-Protein-Function-V0-600M-QAT 1152 36 2048 644M

Examples

Install Library

First, install the esmc-protein-function package using pip. I recommend using a virtual environment such as venv to keep dependencies compartmentalized.

pip install esmc-protein-function

Load Model Weights

Then, we'll load the model weights from HuggingFace Hub by calling the from_pretrained() method. We'll also need the ESM tokenizer from the esm library.

from esmc_protein_function.model import ESMCProteinFunction


model_name = "andrewdalpino/ESMC-Protein-Function-V1-300M"

device = "cpu"  # Can be "cpu", "cuda", "mps", etc.

model = ESMCProteinFunction.from_pretrained(model_name)

model = model.to(device)

Optimize Model Weights

For faster inference and lower memory usage you can cast the weights of the model to a lower bitdepth than the default float32.

import torch


# Cast weights to float16.
model = model.to(dtype=torch.float16)

# Or quantize the weights to Int8.
model.quantize_weights(group_size=64)

Instantiate ESM Tokenizer

We'll also need the ESM tokenizer from the esm library to tokenize the amino acid sequences.

from esm.tokenization import EsmSequenceTokenizer


tokenizer = EsmSequenceTokenizer()

Predict GO Terms

In this example we'll predict the Gene Ontology terms for all apsects of a protein sequence. The return values are dicts with GO terms as keys and predicted probabilities as values.

sequence = "MPPKGHKKTADGDFRPVNSAGNTIQAKQKYSIDDLLYPKSTIKNLAKETLPDDAIISKDALTAIQRAATLFVSYMASHGNASAEAGGRKKIT"

top_p = 0.5

out = tokenizer(sequence, max_length=2048, truncation=True)

x = torch.tensor(out["input_ids"], dtype=torch.int32)

# Add the batch dimension for a single sequence.
x = x.unsqueeze(0)

x = x.to(device)

mf_terms, bp_terms, cc_terms = model.predict_all_terms(x, top_p=top_p)

print(mf_terms[0])
print(bp_terms[0])
print(cc_terms[0])

You can also query individual apsects of the Gene Ontology using the predict_mf_terms(), predict_bp_terms(), and predict_cc_terms() methods like in the example below.

mf_terms = model.predict_mf_terms(x, top_p=top_p)
bp_terms = model.predict_bp_terms(x, top_p=top_p)
cc_terms = model.predict_cc_terms(x, top_p=top_p)

Predict GO Subgraphs

You can also output the GO networkx subgraph for a given sequence. You'll need an up-to-date Gene Ontology database that you can import using the obonet package. The database can be downloaded at https://geneontology.org/docs/download-ontology/.

pip install obonet

Then, load the GO DAG and call the predict_all_subgraphs() method like in the example below.

import networkx as nx

import obonet


go_db_path = "./dataset/go-basic.obo"

graph = obonet.read_obo(go_db_path)

model.load_gene_ontology(graph)

mf_results, bp_results, cc_results = model.predict_all_subgraphs(x, top_p=top_p)

mf_subgraphs, mf_terms = mf_results
bp_subgraphs, bp_terms = bp_results
cc_subgraphs, cc_terms = cc_results

# Render the subgraphs ...

See the predict_subgraphs.py file in the repository for an example of how to render the GO DAGs using Plotly.

Example GO Subgraphs

You can also ouput the subgraphs for individual apsects of the GO using the predict_mf_subgraphs(), predict_bp_subgraphs(), and predict_cc_subgraphs() methods like in the example below.

mf_subgraphs, mf_terms = model.predict_mf_subgraphs(x, top_p=top_p)
bp_subgraphs, bp_terms = model.predict_bp_subgraphs(x, top_p=top_p)
cc_subgraphs, cc_terms = model.predict_cc_subgraphs(x, top_p=top_p)

References

  • T. Hayes, et al. Simulating 500 million years of evolution with a language model, 2024.
  • M. Ashburner, et al. Gene Ontology: tool for the unification of biology, 2000.
  • Z. Lin, et al. A Structured Self-attentive Sentence Embedding, ICLR 2017.
Downloads last month
115
Safetensors
Model size
0.7B params
Tensor type
F32
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Dataset used to train andrewdalpino/ESMC-Protein-Function-V1-600M

Collection including andrewdalpino/ESMC-Protein-Function-V1-600M