ChIPr
Three variants of regression models based on deep neural networks, random forest, and gradient boosting to predict chromatin interaction strength.
Getting started
1- Download the source code
2- Unzip the ChIPr-main.zip file downloaded
3- Prepare the inputs of ChIP-seq data and genomic distance
a- The following lines will download RAD21 ChIP-seq data (bigwig file), converts it to wig file, and divides the data to 23 chromosomes
rad21_folder <- '/Users/s206442/Documents/rad21_bigwig/'
link_rad21 <- 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM935nnn/GSM935332/suppl/GSM935332%5Fhg19%5FwgEncodeSydhTfbsGm12878Rad21IggrabSig%2EbigWig'
fname2 = 'GM12878_rad21.wig'
read_chip_seq(rad21_folder,link_rad21,fname2)
b- Similar following lines will do the same for H3K27ac and H3K27me3 ChIP-seq data
To run the script, go to the code folder in the terminal, and type
Rscript prepare_chipseq_inputs.R
It will take probably 1 day to finish.
To run it of cluster, prepare a bash file with the above command and run it on the cluster
The output folder is determined in the line
op_folder <- '/Users/s206442/Documents/GM_inputs/'
It will look as shown in the picture after running the script
4- Calculate the GC content of the peaks
a- This script needs installing bedtools, see [https://anaconda.org/bioconda/bedtools]
b- Download the hg19 fasta file from: [https://cloud.biohpc.swmed.edu/index.php/s/L654FZNsq2xkwqn]
c- Put the "genome.fa" file in a folder and write it in the line:
fasta_dir = '/Users/s206442/Documents/hg19/'
d- The output folder that will contain the files output from this script is written in the line:
peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'
e- Run the script by typing the command below. It finishes in few minutes
Rscript get_GC_content.R
5- Calculate the CTCF motif orientation flag
a- This script needs installing bedtools as the previous step
b- This line gives the location of "CTCF2.bed" file downloaded with the program. Please adjust the path according to the location on your computer. This file is extracted from "https://compbio.mit.edu/encode-motifs/" [CTCF_known1 motif]
fname_ctcf <- '/Users/s206442/Documents/hg19/CTCF2.bed'
c- This line gives the path to the output folder
peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'
d- Run the script by typing the command below. It finishes in few minutes
Rscript Get_CTCF_orientation_flag.R
6- Prepare the final input training files
a- This line gives the path to the outputs calculated in step 2 above, and it will also be the path of the outputs of this step
op_folder <- '/Users/s206442/Documents/GM_inputs/'
b- This line gives the path of the outputs of steps 3 and 4 above
peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'
c- To run the script, please type on terminal
Rscript prepare_final_input_files.R
7- Train the deep neural network (DNN) model
a- Give the input files path in the line
path = '/Users/s206442/Documents/GM_inputs'
b- Give the name of the model output of the training process
name = 'gm12878_model.h5'
c- Run the script by typing the following command on the terminal
python train_DNN.py
8- Train the Random forest and gradient boosting models
a- Run the script by typing the following command
python train_random_forest_and_gradient_boosting.py
Testing the model on K562 data
1- Prepare K562 ChIP-seq inputs
Rscript prepare_chipseq_inputs_K562.R
2- Get GC content for K562 peaks
Rscript get_GC_content_K562.R
3- Get CTCF orientation flag for K562
Rscript get_CTCF_orientation_flag_K562.R
4- Get final K562 inputs
Rscript get_final_inputs_K562.R
GM12878_RAD21_DNN_trained_model
, you can use it and update the name of the model in the script below if you want) and test on K562 chromosomes (the folder K562_inputs.zip
needed here is also in the files above in the repository if you do not want to prepare the files by the steps above)
5- Load the trained DNN model (there is also a pretrained model in the folder from numpy import genfromtxt
def read_test_data(path,chr_num):
fname = f'{path}/reduced_inputs_{chr_num}_gc_orient.csv'
inputs = genfromtxt(fname, delimiter='\t')
fname = f'{path}/reduced_outputs_{chr_num}_gc_orient.csv'
outputs = genfromtxt(fname, delimiter='\t')
I = inputs
O = outputs
return I,O
from keras.models import load_model
import scipy
path = '/work/pathology/s206442/dbet_data/test_chipr/K562/K562_inputs/'
I,O = read_test_data(path,'chr20')
name = 'gm12878_model.h5'
NN_model = load_model(name)
P = NN_model.predict(I)
A = scipy.stats.pearsonr(O.flatten(), P.flatten())
A
Using ChIPr to predict GM12878 Hi-C interactions
-
User can download sample data for GM12878 cell line and also GB-ChIPr trained model on IMR90 Hi-C data from: 10.5281/zenodo.10471911
-
An example on how to predict the interactions and Pearson correlation results can be found in this repository:
./Validation_Results/get_results-HiC_GM_cell_line_Gradient_Boosting.ipynb