Skip to content
Snippets Groups Projects
user avatar
Christopher Bennett authored
a47fb6a2

TF-gene network construction

Goal is to construct mouse cell-type specific TF-gene networks. There are two parts to the network construction: i) Enhancer-gene mapping and ii) TF-enhancer mapping. mouse ENOCDE dataset is used to construct the network.

Details of Main Programs

  • get_encode_data.py: Download data (ChIP-seq,RNA-seq) from ENCODE Portal using metadata.tsv.

EXAMPLE :

get_encode_data.pl -a 'mm10' -o '/path/to/encode' -b 'tissue' -t 'P24w'
  • get_transcript.py: Get transcripts and TSS per transcript using Gencode annotations file for a list of biotypes.

EXAMPLE :

get_transcript.py -a gencode.vM4.annotation.gtf -b protein_coding
  • enhancer_boundries.py: Create a universe of merge peaks that don't overlap defined region.

EXAMPLE :

enhancer_boundries.py  -t gencode.vM4.annotation_capped_sites.gff -r 2500 -s mm10_chrom.sizes -d 150 -p '/path/to/peaks/*'
  • merge_transcript_expression.py: Make matrix of log2(TPM) of transcripts for all samples.

EXAMPLE :

merge_transcript_expression.py  -e '/path/to/expression/*tsv'
  • merge_enhancer_expression.py: Make matrix of log2(TPM) of enhancers for all samples.

EXAMPLE :

merge_enhancer_expression.py  -e 'merged_enhancers.bed' -t H3K27ac -a '/path/to/alignment/*bam'

Setup

Execute the below commands to clone the repository and additional submodules.

git clone --recursive https://git.biohpc.swmed.edu/viren.amin/trenco.git

TF binding site (TFBS) prediction

We are using FIMO to score motif given sequence. To incorporate the cell type specific histone modification signal and sequence conservation, we create priors.wig file that incorporates our prior belief of TFBS site.

Note: When taking average of all bam files, make sure to remove bam files that are problematic. For instance pearson correlation over the enhancers are not correlative.

  • FIMO usage

    fimo -o <output_dir> --psp <priors.wig> <motifs> <sequences>

    --psp is for position specific priors. To create position specific priors use create-priors

  • create-priors usage

    create-priors <fasta> <wiggle>

    wiggles files contains the tag/read count To create fasta from bed, in bedtools there is bedtools getfasta utility

  • getfasta usage

    bedtools getfasta -fi <fasta> -bed <bed> > fasta_out

  • To access genomes in BioHPC, they are all pre-installed. Do module load iGenomes.... and echo $iGENOMES.....

  • To access ucsc tools in BioHPC do module load UCSC....

Here are the steps used to obtain motif occurence in the fasta sequence provided

  1. Create the bedgraph read count for each basepairs for all the replicates
bedtools coverage -a BED_FILE.bed -b BAM_FILES.bam -d > out_sum_depth_each_bp.bed
  1. Generate bedgraph file with average signal over the replicates
awk 'BEGIN {OFS="\t"}; {print $1, $2+$4-1, $2+$4, $5/2}' out_sum_depth_each_bp.bed > avg_enh_cov.bg
  1. Convert bedgraph to wig
perl ../bedgraphToWig.pl --bedgraph avg_enh_cov.bg --wig genome_coverage.wig --step 20 --compact
  1. Generate smoothed version of wig file
create-priors --oc heart_P0_priors5 --alpha 1 --beta 130000 --num-bins 20 --parse-genomic-coord enhancers_sort_sequence.fa heart_P0_genome_coverage_enhancers.wig
  1. Run FIMO with position specific priors
fimo -oc cisBP_sample --bgfile shuffled_enh_background_file --parse-genomic-coord --psp priors.wig --prior-dist priors.dist M1628_1.02.meme ../enhancers_sort_sequence.fa

mouseENOCDE Data

Processed data is located at: /project/bioinformatics/shared/tfgnetwork/encode/mouse/

There are two directories

  • ChIP-seq - Contains processed ChIP-seq files for 8 histone marks and all the 9 tissues at different developmental timepoints. metadata.tsv file within the directory provides metadata for all the samples.

  • RNA-seq - Contains TPM quantifications for all the 9 tissues at different developmental timepoints. metadata.tsv file within the directory provides metadata for all the samples.

Note: For ChIP-seq, only relevant files were downloaded for the analysis.

Repository Data

Intermediary files for the repository is located at: /project/bioinformatics/shared/tfgnetwork/repo_data

Repository data contains motifs (CisBP motifs), mouse enhancers, enhancer signal matrix across all samples, gene expression signal matrix across all samples, TAD positions, and enhancer-gene networks for all the 66 context.