vr-scientist's picture
Relicense to MIT; update model card
ac4c598 verified
|
Raw
History Blame Contribute Delete
5.04 kB
---
license: mit
library_name: bpnet
tags:
- bpnet
- dna
- genomics
- transcription-factor-binding
- chip-seq
- encode
- encode-bpnet-atlas
- hg38
- qc-unvalidated
- ZSCAN30
---
# ENCODE BPNet Atlas
As part of the ENCODE 4 Project, we trained BPNet models on 2,339 ENCODE
transcription factor ChIP-seq experiments spanning 788 targets across
175 biosamples. Here, we provide all models for open-source use.
For more information about the models, see:
- Main ENCODE 4 Paper
- A unified lexicon of predictive DNA sequence motifs from ENCODE transcription
factor binding and chromatin accessibility assays (Deshpande et al., Zenodo 2025)
- Base-resolution models of transcription-factor binding reveal soft motif syntax
(Avsec et al., Nat Genet 2021)
## BPNet model: ZSCAN30 ChIP-seq in HepG2 (ENCSR795CKZ)
- Model: BPNet
- Assay: TF ChIP-seq
- Target: ZSCAN30
- Experiment: [ENCSR795CKZ](https://www.encodeproject.org/experiments/ENCSR795CKZ/)
- Model annotation: [ENCSR128AHG](https://www.encodeproject.org/annotations/ENCSR128AHG/)
- Biosample: HepG2 (Full name: Homo sapiens HepG2 genetically modified (insertion) using CRISPR targeting H. sapiens ZSCAN30)
- Cell slim(s): cancer cell, epithelial cell
- Organ slim(s): epithelium, exocrine gland, endocrine gland, liver
- Developmental slim(s): endoderm
- System slim(s): digestive system, endocrine system, exocrine system
- Assembly: hg38
## QC
- Status: unvalidated
- Notes: Found potential direct motif (counts);
## Directory structure
5-fold cross-validation. Each `fold_*/` contains the trained BPNet model in two formats:
- `fold_0/model.h5` — BPNet model in .h5 (Keras) format
- `fold_0/saved_model/` — BPNet model in TensorFlow SavedModel format (a directory; load directly)
- `config.json` — training / architecture parameters
## Instructions
BPNet takes a one-hot DNA sequence plus control (bias) inputs and predicts
stranded profile logits and total logcounts. The control inputs come from the
matched WCE/Input DNA control and **can be passed as zeros**.
### 1. Loading the SavedModel and making predictions
```python
import numpy as np
import tensorflow as tf
from scipy.special import logsumexp
model = tf.saved_model.load("fold_0/saved_model")
# sequence: (N, 2114, 4) one-hot [A,C,G,T]
# profile_bias_input: (N, 1000, 2) per-base profile bias from WCE/Input control, or zeros
# counts_bias_input: (N, 2) log2 total counts from WCE/Input control, or zeros
predictions = model.signatures["serving_default"](**{
"sequence": sequence.astype("float32"),
"profile_bias_input_0": profile_bias_input.astype("float32"),
"counts_bias_input_0": counts_bias_input.astype("float32")})
# predictions["profile_predictions"]: (N, 1000, 2) logits (strands NOT independent)
# predictions["logcounts_predictions"]: (N, 1) total logcount
output_len = 1000
def vectorized_prediction_to_profile(predictions):
logits_arr = predictions["profile_predictions"]
counts_arr = predictions["logcounts_predictions"]
pred_profile_logits = np.reshape(logits_arr, [-1, 1, output_len * 2])
probVals_array = np.exp(pred_profile_logits - logsumexp(
pred_profile_logits, axis=2).reshape([len(logits_arr), 1, 1]))
profile_predictions = np.multiply(
np.exp(counts_arr).reshape([len(counts_arr), 1, 1]), probVals_array)
plus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 0]
minus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 1]
return plus, minus, counts_arr
plus, minus, logcounts = vectorized_prediction_to_profile(predictions)
```
### 2. Loading the .h5 (Keras) and making predictions
```python
import numpy as np
import tensorflow as tf
import tensorflow.keras.backend as kb
from tensorflow.keras.models import load_model
from tensorflow.keras.utils import CustomObjectScope
from bpnet.model.custommodel import CustomModel
def get_model(model_path):
with CustomObjectScope({"kb": kb, "tf": tf, "CustomModel": CustomModel}):
return load_model(model_path)
model = get_model("fold_0/model.h5")
N = sequence.shape[0]
predictions = model.predict([
sequence, # (N, 2114, 4)
np.zeros((N, 1000, 2)), # profile_bias_input (or real WCE/Input control values)
np.zeros((N, 2))]) # counts_bias_input (or real control log2 counts)
# predictions[0]: (N, 1000, 2) logits; predictions[1]: (N, 1) logcounts
# convert with the same vectorized_prediction_to_profile() (predictions[0], predictions[1])
```
## Docker image to load and use the models
`kundajelab/bpnet-atlas` (placeholder — image forthcoming).
## Code
- Code: https://github.com/kundajelab/bpnet/
- Toolbox & downstream analysis: https://github.com/kundajelab/bpnet/wiki
## License & citation
Released under the MIT license.
The models are derived from ENCODE data (unrestricted use under the ENCODE
data-use policy). Please cite the ENCODE Project Consortium and the model
software: BPNet (Avsec et al., Nat Genet 2021).