Relicense to MIT; update model card

ac4c598 verified 12 days ago

5.04 kB

	---
	license: mit
	library_name: bpnet
	tags:
	- bpnet
	- dna
	- genomics
	- transcription-factor-binding
	- chip-seq
	- encode
	- encode-bpnet-atlas
	- hg38
	- qc-unvalidated
	- ZSCAN30
	---

	# ENCODE BPNet Atlas

	As part of the ENCODE 4 Project, we trained BPNet models on 2,339 ENCODE
	transcription factor ChIP-seq experiments spanning 788 targets across
	175 biosamples. Here, we provide all models for open-source use.

	For more information about the models, see:

	- Main ENCODE 4 Paper
	- A unified lexicon of predictive DNA sequence motifs from ENCODE transcription
	factor binding and chromatin accessibility assays (Deshpande et al., Zenodo 2025)
	- Base-resolution models of transcription-factor binding reveal soft motif syntax
	(Avsec et al., Nat Genet 2021)

	## BPNet model: ZSCAN30 ChIP-seq in HepG2 (ENCSR795CKZ)

	- Model: BPNet
	- Assay: TF ChIP-seq
	- Target: ZSCAN30
	- Experiment: [ENCSR795CKZ](https://www.encodeproject.org/experiments/ENCSR795CKZ/)
	- Model annotation: [ENCSR128AHG](https://www.encodeproject.org/annotations/ENCSR128AHG/)
	- Biosample: HepG2 (Full name: Homo sapiens HepG2 genetically modified (insertion) using CRISPR targeting H. sapiens ZSCAN30)
	- Cell slim(s): cancer cell, epithelial cell
	- Organ slim(s): epithelium, exocrine gland, endocrine gland, liver
	- Developmental slim(s): endoderm
	- System slim(s): digestive system, endocrine system, exocrine system
	- Assembly: hg38

	## QC

	- Status: unvalidated
	- Notes: Found potential direct motif (counts);

	## Directory structure

	5-fold cross-validation. Each `fold_*/` contains the trained BPNet model in two formats:

	- `fold_0/model.h5` — BPNet model in .h5 (Keras) format
	- `fold_0/saved_model/` — BPNet model in TensorFlow SavedModel format (a directory; load directly)
	- `config.json` — training / architecture parameters

	## Instructions

	BPNet takes a one-hot DNA sequence plus control (bias) inputs and predicts
	stranded profile logits and total logcounts. The control inputs come from the
	matched WCE/Input DNA control and can be passed as zeros.

	### 1. Loading the SavedModel and making predictions

	```python
	import numpy as np
	import tensorflow as tf
	from scipy.special import logsumexp

	model = tf.saved_model.load("fold_0/saved_model")
	# sequence: (N, 2114, 4) one-hot [A,C,G,T]
	# profile_bias_input: (N, 1000, 2) per-base profile bias from WCE/Input control, or zeros
	# counts_bias_input: (N, 2) log2 total counts from WCE/Input control, or zeros
	predictions = model.signatures["serving_default"](**{
	"sequence": sequence.astype("float32"),
	"profile_bias_input_0": profile_bias_input.astype("float32"),
	"counts_bias_input_0": counts_bias_input.astype("float32")})
	# predictions["profile_predictions"]: (N, 1000, 2) logits (strands NOT independent)
	# predictions["logcounts_predictions"]: (N, 1) total logcount

	output_len = 1000
	def vectorized_prediction_to_profile(predictions):
	logits_arr = predictions["profile_predictions"]
	counts_arr = predictions["logcounts_predictions"]
	pred_profile_logits = np.reshape(logits_arr, [-1, 1, output_len * 2])
	probVals_array = np.exp(pred_profile_logits - logsumexp(
	pred_profile_logits, axis=2).reshape([len(logits_arr), 1, 1]))
	profile_predictions = np.multiply(
	np.exp(counts_arr).reshape([len(counts_arr), 1, 1]), probVals_array)
	plus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 0]
	minus = np.reshape(profile_predictions, [len(counts_arr), output_len, 2])[:, :, 1]
	return plus, minus, counts_arr

	plus, minus, logcounts = vectorized_prediction_to_profile(predictions)
	```

	### 2. Loading the .h5 (Keras) and making predictions

	```python
	import numpy as np
	import tensorflow as tf
	import tensorflow.keras.backend as kb
	from tensorflow.keras.models import load_model
	from tensorflow.keras.utils import CustomObjectScope
	from bpnet.model.custommodel import CustomModel

	def get_model(model_path):
	with CustomObjectScope({"kb": kb, "tf": tf, "CustomModel": CustomModel}):
	return load_model(model_path)

	model = get_model("fold_0/model.h5")
	N = sequence.shape[0]
	predictions = model.predict([
	sequence, # (N, 2114, 4)
	np.zeros((N, 1000, 2)), # profile_bias_input (or real WCE/Input control values)
	np.zeros((N, 2))]) # counts_bias_input (or real control log2 counts)
	# predictions[0]: (N, 1000, 2) logits; predictions[1]: (N, 1) logcounts
	# convert with the same vectorized_prediction_to_profile() (predictions[0], predictions[1])
	```

	## Docker image to load and use the models

	`kundajelab/bpnet-atlas` (placeholder — image forthcoming).

	## Code

	- Code: https://github.com/kundajelab/bpnet/
	- Toolbox & downstream analysis: https://github.com/kundajelab/bpnet/wiki

	## License & citation

	Released under the MIT license.

	The models are derived from ENCODE data (unrestricted use under the ENCODE
	data-use policy). Please cite the ENCODE Project Consortium and the model
	software: BPNet (Avsec et al., Nat Genet 2021).