| --- |
| license: mit |
| library_name: bpnet |
| tags: |
| - bpnet |
| - dna |
| - genomics |
| - transcription-factor-binding |
| - chip-seq |
| - encode |
| - encode-bpnet-atlas |
| - hg38 |
| - qc-unvalidated |
| - ZSCAN30 |
| --- |
| |
| # ENCODE BPNet Atlas |
|
|
| As part of the ENCODE 4 Project, we trained BPNet models on 2,339 ENCODE |
| transcription factor ChIP-seq experiments spanning 788 targets across |
| 175 biosamples. Here, we provide all models for open-source use. |
|
|
| For more information about the models, see: |
|
|
| - Main ENCODE 4 Paper |
| - A unified lexicon of predictive DNA sequence motifs from ENCODE transcription |
| factor binding and chromatin accessibility assays (Deshpande et al., Zenodo 2025) |
| - Base-resolution models of transcription-factor binding reveal soft motif syntax |
| (Avsec et al., Nat Genet 2021) |
|
|
| ## BPNet model: ZSCAN30 ChIP-seq in HepG2 (ENCSR795CKZ) |
|
|
| - Model: BPNet |
| - Assay: TF ChIP-seq |
| - Target: ZSCAN30 |
| - Experiment: [ENCSR795CKZ](https://www.encodeproject.org/experiments/ENCSR795CKZ/) |
| - Model annotation: [ENCSR128AHG](https://www.encodeproject.org/annotations/ENCSR128AHG/) |
| - Biosample: HepG2 (Full name: Homo sapiens HepG2 genetically modified (insertion) using CRISPR targeting H. sapiens ZSCAN30) |
| - Cell slim(s): cancer cell, epithelial cell |
| - Organ slim(s): epithelium, exocrine gland, endocrine gland, liver |
| - Developmental slim(s): endoderm |
| - System slim(s): digestive system, endocrine system, exocrine system |
| - Assembly: hg38 |
|
|
| ## QC |
|
|
| - Status: unvalidated |
| - Notes: Found potential direct motif (counts); |
|
|
| ## Directory structure |
|
|
| 5-fold cross-validation. Each `fold_*/` contains the trained BPNet model in two formats: |
|
|
| - `fold_0/model.h5` — BPNet model in .h5 (Keras) format |
| - `fold_0/saved_model/` — BPNet model in TensorFlow SavedModel format (a directory; load directly) |
| - `config.json` — training / architecture parameters |
|
|
| ## Instructions |
|
|
| BPNet takes a one-hot DNA sequence plus control (bias) inputs and predicts |
| stranded profile logits and total logcounts. The control inputs come from the |
| matched WCE/Input DNA control and **can be passed as zeros**. |
|
|
| ### 1. Loading the SavedModel and making predictions |
|
|
| ```python |
| import numpy as np |
| import tensorflow as tf |
| from scipy.special import logsumexp |
| |
| model = tf.saved_model.load("fold_0/saved_model") |
| # sequence: (N, 2114, 4) one-hot [A,C,G,T] |
| # profile_bias_input: (N, 1000, 2) per-base profile bias from WCE/Input control, or zeros |
| # counts_bias_input: (N, 2) log2 total counts from WCE/Input control, or zeros |
| predictions = model.signatures["serving_default"](**{ |
| "sequence": sequence.astype("float32"), |
| "profile_bias_input_0": profile_bias_input.astype("float32"), |
| "counts_bias_input_0": counts_bias_input.astype("float32")}) |
| # predictions["profile_predictions"]: (N, 1000, 2) logits (strands NOT independent) |
| # predictions["logcounts_predictions"]: (N, 1) total logcount |
| |
| output_len = 1000 |
| def vectorized_prediction_to_profile(predictions): |
| logits_arr = predictions["profile_predictions"] |
| counts_arr = predictions["logcounts_predictions"] |
| pred_profile_logits = np.reshape(logits_arr, [-1, 1, output_len * 2]) |
| probVals_array = np.exp(pred_profile_logits - logsumexp( |
| pred_profile_logits, axis=2).reshape([len(logits_arr), 1, 1])) |
| profile_predictions = np.multiply( |
| np.exp(counts_arr).reshape([len(counts_arr), 1, 1]), probVals_array) |
| plus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 0] |
| minus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 1] |
| return plus, minus, counts_arr |
| |
| plus, minus, logcounts = vectorized_prediction_to_profile(predictions) |
| ``` |
|
|
| ### 2. Loading the .h5 (Keras) and making predictions |
|
|
| ```python |
| import numpy as np |
| import tensorflow as tf |
| import tensorflow.keras.backend as kb |
| from tensorflow.keras.models import load_model |
| from tensorflow.keras.utils import CustomObjectScope |
| from bpnet.model.custommodel import CustomModel |
| |
| def get_model(model_path): |
| with CustomObjectScope({"kb": kb, "tf": tf, "CustomModel": CustomModel}): |
| return load_model(model_path) |
| |
| model = get_model("fold_0/model.h5") |
| N = sequence.shape[0] |
| predictions = model.predict([ |
| sequence, # (N, 2114, 4) |
| np.zeros((N, 1000, 2)), # profile_bias_input (or real WCE/Input control values) |
| np.zeros((N, 2))]) # counts_bias_input (or real control log2 counts) |
| # predictions[0]: (N, 1000, 2) logits; predictions[1]: (N, 1) logcounts |
| # convert with the same vectorized_prediction_to_profile() (predictions[0], predictions[1]) |
| ``` |
|
|
| ## Docker image to load and use the models |
|
|
| `kundajelab/bpnet-atlas` (placeholder — image forthcoming). |
|
|
| ## Code |
|
|
| - Code: https://github.com/kundajelab/bpnet/ |
| - Toolbox & downstream analysis: https://github.com/kundajelab/bpnet/wiki |
|
|
| ## License & citation |
|
|
| Released under the MIT license. |
|
|
| The models are derived from ENCODE data (unrestricted use under the ENCODE |
| data-use policy). Please cite the ENCODE Project Consortium and the model |
| software: BPNet (Avsec et al., Nat Genet 2021). |
|
|