Skip to content
Snippets Groups Projects

ChIPr

Three variants of regression models based on deep neural networks, random forest, and gradient boosting to predict chromatin interaction strength.

Getting started

1- Download the source code

2- Unzip the ChIPr-main.zip file downloaded

3- Prepare the inputs of ChIP-seq data and genomic distance

a- The following lines will download RAD21 ChIP-seq data (bigwig file), converts it to wig file, and divides the data to 23 chromosomes

rad21_folder <- '/Users/s206442/Documents/rad21_bigwig/'
link_rad21 <- 'https://ftp.ncbi.nlm.nih.gov/geo/samples/GSM935nnn/GSM935332/suppl/GSM935332%5Fhg19%5FwgEncodeSydhTfbsGm12878Rad21IggrabSig%2EbigWig'
fname2 = 'GM12878_rad21.wig'
read_chip_seq(rad21_folder,link_rad21,fname2)

b- Similar following lines will do the same for H3K27ac and H3K27me3 ChIP-seq data

To run the script, go to the code folder in the terminal, and type

Rscript prepare_chipseq_inputs.R

It will take probably 1 day to finish.

To run it of cluster, prepare a bash file with the above command and run it on the cluster

The output folder is determined in the line

op_folder <- '/Users/s206442/Documents/GM_inputs/'

It will look as shown in the picture after running the script

ALT

4- Calculate the GC content of the peaks

a- This script needs installing bedtools, see [https://anaconda.org/bioconda/bedtools]

b- Download the hg19 fasta file from: [https://cloud.biohpc.swmed.edu/index.php/s/L654FZNsq2xkwqn]

c- Put the "genome.fa" file in a folder and write it in the line:

fasta_dir = '/Users/s206442/Documents/hg19/'

d- The output folder that will contain the files output from this script is written in the line:

peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'

e- Run the script by typing the command below. It finishes in few minutes

Rscript get_GC_content.R

5- Calculate the CTCF motif orientation flag

a- This script needs installing bedtools as the previous step

b- This line gives the location of "CTCF2.bed" file downloaded with the program. Please adjust the path according to the location on your computer. This file is extracted from "https://compbio.mit.edu/encode-motifs/" [CTCF_known1 motif]

fname_ctcf <- '/Users/s206442/Documents/hg19/CTCF2.bed'

c- This line gives the path to the output folder

peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'

d- Run the script by typing the command below. It finishes in few minutes

Rscript Get_CTCF_orientation_flag.R

6- Prepare the final input training files

a- This line gives the path to the outputs calculated in step 2 above, and it will also be the path of the outputs of this step

op_folder <- '/Users/s206442/Documents/GM_inputs/'

b- This line gives the path of the outputs of steps 3 and 4 above

peaks_bed_folder <- '/Users/s206442/Documents/peaks_bed_files/'

c- To run the script, please type on terminal

Rscript prepare_final_input_files.R

7- Train the deep neural network (DNN) model

a- Give the input files path in the line

path = '/Users/s206442/Documents/GM_inputs'

b- Give the name of the model output of the training process

name = 'gm12878_model.h5'

c- Run the script by typing the following command on the terminal

python train_DNN.py

8- Train the Random forest and gradient boosting models

a- Run the script by typing the following command

python train_random_forest_and_gradient_boosting.py

Testing the model on K562 data

1- Prepare K562 ChIP-seq inputs

Rscript prepare_chipseq_inputs_K562.R

2- Get GC content for K562 peaks

Rscript get_GC_content_K562.R

3- Get CTCF orientation flag for K562

Rscript get_CTCF_orientation_flag_K562.R

4- Get final K562 inputs

Rscript get_final_inputs_K562.R

5- Load the trained DNN model (there is also a pretrained model in the folder GM12878_RAD21_DNN_trained_model, you can use it and update the name of the model in the script below if you want) and test on K562 chromosomes (the folder K562_inputs.zip needed here is also in the files above in the repository if you do not want to prepare the files by the steps above)

from numpy import genfromtxt
def read_test_data(path,chr_num):
    fname = f'{path}/reduced_inputs_{chr_num}_gc_orient.csv'
    inputs = genfromtxt(fname, delimiter='\t')
    fname = f'{path}/reduced_outputs_{chr_num}_gc_orient.csv'
    outputs = genfromtxt(fname, delimiter='\t')
    I = inputs
    O = outputs
    return I,O

from keras.models import load_model
import scipy
path = '/work/pathology/s206442/dbet_data/test_chipr/K562/K562_inputs/'
I,O = read_test_data(path,'chr20')
name =  'gm12878_model.h5'
NN_model = load_model(name)
P = NN_model.predict(I)
A = scipy.stats.pearsonr(O.flatten(), P.flatten())
A

Using ChIPr to predict GM12878 Hi-C interactions

  1. User can download sample data for GM12878 cell line and also GB-ChIPr trained model on IMR90 Hi-C data from: 10.5281/zenodo.10471911

  2. An example on how to predict the interactions and Pearson correlation results can be found in this repository: ./Validation_Results/get_results-HiC_GM_cell_line_Gradient_Boosting.ipynb

For any questions, please contact:

ahmed.abbaselmahdi@utsouthwestern.edu