Commit 661f570d authored by Christopher Bennett's avatar Christopher Bennett
Browse files

Update README.md for v1.0.0

parent b7ecead1
# TF-gene network construction
Goal is to construct mouse cell-type specific TF-gene networks. There are two parts to the network construction: i) Enhancer-gene mapping and ii) TF-enhancer mapping. mouse ENOCDE dataset is used to construct the network.
### Details of Main Programs
- get_encode_data.py: Download data (ChIP-seq,RNA-seq) from ENCODE Portal using metadata.tsv.
EXAMPLE :
```
get_encode_data.pl -a 'mm10' -o '/path/to/encode' -b 'tissue' -t 'P24w'
```
- get_transcript.py: Get transcripts and TSS per transcript using Gencode annotations file for a list of biotypes.
EXAMPLE :
```
get_transcript.py -a gencode.vM4.annotation.gtf -b protein_coding
```
- enhancer_boundries.py: Create a universe of merge peaks that don't overlap defined region.
EXAMPLE :
```
enhancer_boundries.py -t gencode.vM4.annotation_capped_sites.gff -r 2500 -s mm10_chrom.sizes -d 150 -p '/path/to/peaks/*'
```
- merge_transcript_expression.py: Make matrix of log2(TPM) of transcripts for all samples.
EXAMPLE :
```
merge_transcript_expression.py -e '/path/to/expression/*tsv'
```
- merge_enhancer_expression.py: Make matrix of log2(TPM) of enhancers for all samples.
EXAMPLE :
```
merge_enhancer_expression.py -e 'merged_enhancers.bed' -t H3K27ac -a '/path/to/alignment/*bam'
```
### Setup
Execute the below commands to clone the repository and additional submodules.
```
git clone --recursive https://git.biohpc.swmed.edu/viren.amin/trenco.git
```
### TF binding site (TFBS) prediction
We are using FIMO to score motif given sequence. To incorporate the cell type specific histone modification signal and sequence conservation, we create priors.wig file that incorporates our prior belief of TFBS site.
Note: When taking average of all bam files, make sure to remove bam files that are problematic. For instance pearson correlation over the enhancers are not correlative.
- [FIMO](http://meme-suite.org/doc/fimo.html) usage
`fimo -o <output_dir> --psp <priors.wig> <motifs> <sequences> `
--psp is for position specific priors. To create position specific priors use `create-priors`
- [create-priors](http://meme-suite.org/doc/create-priors.html) usage
`create-priors <fasta> <wiggle>`
wiggles files contains the tag/read count
To create fasta from bed, in bedtools there is bedtools getfasta utility
- [getfasta](http://bedtools.readthedocs.io/en/latest/content/tools/getfasta.html) usage
`bedtools getfasta -fi <fasta> -bed <bed> > fasta_out`
- To access genomes in BioHPC, they are all pre-installed. Do `module load iGenomes....` and `echo $iGENOMES.....`
- To access ucsc tools in BioHPC do `module load UCSC....`
Here are the steps used to obtain motif occurence in the fasta sequence provided
1. Create the bedgraph read count for each basepairs for all the replicates
```
bedtools coverage -a BED_FILE.bed -b BAM_FILES.bam -d > out_sum_depth_each_bp.bed
```
2. Generate bedgraph file with average signal over the replicates
```
awk 'BEGIN {OFS="\t"}; {print $1, $2+$4-1, $2+$4, $5/2}' out_sum_depth_each_bp.bed > avg_enh_cov.bg
# Topologically Associated Domain (TAD) aware Regulatory Network Construction (TReNCo)
## About
TReNCo is designed to be an all-in-one Transcription Factor gene-by-gene regulatory network builder. TReNCo utilizes common assay’s, ChIP-seq, RNA-seq, and TAD boundaries as a hard cutoff, instead of distance based, to efficiently create context-specific TF-gene regulatory networks. There are three parts to the network construction: i) Enhancer-gene mapping ii) TF-enhancer mapping and iii) TAD Enhancer-gene weighing.
## Release
May 1, 2021 - v1.0.0 - Initial release of core TReNCo code
## Installation
### Requirements
python 3.7 <
conda 4.7 <
bcftools 1.4 <
samtools 1.4 <
bedops 2.4 <
bedtools 2.29 <
meme 5.0 <
### Basic Steps on Linux
Clone TReNCo repo into local repository and add to path
(Assuming .bashrc file and repo location in ~/ home directory on Linux)
```
git clone https://git.biohpc.swmed.edu/BICF/Software/trenco.git
echo "export PATH=~/trenco:$PATH" >> ~/.bashrc
echo "export PYHONPATH=~/trenco_modules:$PYTHONPATH" >> ~/.bashrc
```
Activate TReNCo conda environment
```
conda env create -f trenco_env.yml
conda activate trenco_env
```
## Usage
```
trenco --design [DESIGN FILE (txt)] --alignment [ALIGNMENT FILES (tsv)] --expression [EXPRESSION FILES (bam)] --peaks [PEAK FILES (bed)] -g [GENOME VERSION] [OPTIONS]
```
TAD aware Regulatory Network Construction (TReNCo)
### Main Arguments
```
-h, --help |show this help message and exit.
------------------------------------------|----------------------------------------------------------
|
--design DESIGN |Design file containing link information to samples.
------------------------------------------|----------------------------------------------------------
|
--alignment ALIGNMENT [ALIGNMENT ...] |Full path to ChIP alingment files in bam format
------------------------------------------|----------------------------------------------------------
|
--expression EXPRESSION [EXPRESSION ...]|Full path to transcript expression table in tsv format
------------------------------------------|----------------------------------------------------------
|
--enhMarks TARGET |Mark for enchancers: Default H3K27ac
------------------------------------------|----------------------------------------------------------
|
--tadBED TADBED |TAD file: Default - mm10 TAD in common_files
------------------------------------------|----------------------------------------------------------
|
-p, --peaks PEAKS [PEAKS ...]. |The full path to peak files in bed format
------------------------------------------|----------------------------------------------------------
|
-g, --genome-version GVERS. |Version of genome to use. Default is newest
|
```
### Optional Arguments:
```
-r, --region REGION |The number of bases pairs to exclude around TSS
| (Default: 2500)
------------------------------------------|----------------------------------------------------------
|
-q, --promoter-range PROMOTER_RANGE |Range of numbers before TSS and after TSS to consider as Promoter
| (Default: 1000-200)
------------------------------------------|----------------------------------------------------------
|
-d, --distance DISTANCE |The number of bases pairs to merge between adjacent peaks
| (Default: 150)
------------------------------------------|----------------------------------------------------------
|
--annotation-file ANNOTFNAME |Genode annotations file in gtf format
| (overwrites --annotation-version and --organism)
------------------------------------------|----------------------------------------------------------
|
--annotation-version ANNOTATIONS |The Gencode annotations file in gtf format. (Default: vM4)
| (WARNING: All entries are indexed to this version)
------------------------------------------|----------------------------------------------------------
|
--organism ORGANISM |Organism gencode to download (Default: mouse)
------------------------------------------|----------------------------------------------------------
|
-b, --biotypes BIOTYPES [BIOTYPES ...] |The biotypes to get transcript TSS. (default: protein)
------------------------------------------|----------------------------------------------------------
|
--meme-db MEME_DB |MEME database to use (Default: cis-bp)
------------------------------------------|----------------------------------------------------------
|
--db DB |Motif database name if different from species
| (ex JASPER CORE 2014 for jasper)
------------------------------------------|----------------------------------------------------------
|
--bed BED |ChIP and Promoter bed file for getting motifs
| (ex enh.bed,promoter.bed)
```
3. Convert bedgraph to wig
```
perl ../bedgraphToWig.pl --bedgraph avg_enh_cov.bg --wig genome_coverage.wig --step 20 --compact
```
4. Generate smoothed version of wig file
```
create-priors --oc heart_P0_priors5 --alpha 1 --beta 130000 --num-bins 20 --parse-genomic-coord enhancers_sort_sequence.fa heart_P0_genome_coverage_enhancers.wig
```
5. Run FIMO with position specific priors
```
fimo -oc cisBP_sample --bgfile shuffled_enh_background_file --parse-genomic-coord --psp priors.wig --prior-dist priors.dist M1628_1.02.meme ../enhancers_sort_sequence.fa
```
### mouseENOCDE Data
Processed data is located at: `/project/bioinformatics/shared/tfgnetwork/encode/mouse/`
There are two directories
- `ChIP-seq` - Contains processed ChIP-seq files for 8 histone marks and all the 9 tissues at different developmental timepoints. `metadata.tsv` file within the directory provides metadata for all the samples.
- `RNA-seq` - Contains TPM quantifications for all the 9 tissues at different developmental timepoints. `metadata.tsv` file within the directory provides metadata for all the samples.
Note: For ChIP-seq, only relevant files were downloaded for the analysis.
### Repository Data
Intermediary files for the repository is located at:
`/project/bioinformatics/shared/tfgnetwork/repo_data`
Repository data contains motifs (CisBP motifs), mouse enhancers, enhancer signal matrix across all samples, gene expression signal matrix across all samples, TAD positions, and enhancer-gene networks for all the 66 context.
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment