Total Functional Score of Enhancer Elements Identifies Lineage-Specific Enhancers that Drive Differentiation of Pancreatic Cells
This directory contains the scripts for identification of TFs maintaining multipotency of endodermal stem cells during differentiation into pancreatic lineages ,using TFSEE.
Dependencies for TFSEE
This code requires python 2.7+ to run.
The pythons scripts require the following python packages:
- biopython-1.70
- pandas-0.20.1
- numpy-1.12.1
- scikit-learn-0.18.1
- matplotlib-2.0.2
- seaborn-0.8.1
- scipy-0.19.0
Install the dependencies.
pip install -r requirements.txt
Pipeline Description
Pre-processing Steps
-
De novo identification of enhancers using GRO-seq and groHMM or ChIP-seq (H3K4me1 and H3K27ac)
-
Normalize Enhancer Expression using GRO-seq: For each cell line, quantify the GRO-seq reads, RPKM, that fall within a 1 kb region around the center of the overlap for paired enhancer transcripts or from the 5′ end of unpaired enhancer transcripts
-
Normalize Enhancer Expression using ChIP-seq: For each cell line, quantify the ChIP-seq reads, RPKM, from H3K4me1, H3K27ac, and input for each enhancer within the universe of GRO-seq-defined enhancers
-
Motif Predictions: De novo motif analyses on a 1 kb region of expressed enhancers for each cell line using MEME and matched to known motifs using TOMTOM and JASPAR
-
Normalize Transcription Factor Expression using RNA-seq: For each cell line, quantify the RNA-seq reads, FPKM, for each transcription factor that is a binding target for the motifs
- RNA-seq analysis: RNA-seq_star.sh
- FPKM processing RNA-seq: rnaseq_processing.sh
- Calculate TFSEE score to determining cell-type specific enhancer activity, generating:
- unsupervised hierarchical clustering
- tSNE representation
- boxplot representations
- rank order TF plots
Data Source
All dta available from NCBI’s Gene Expression Omnibus [@url:https://www.ncbi.nlm.nih.gov/geo/] or EMBL-EBI’s ArrayExpress [@url:http://www.ebi.ac.uk/arrayexpress/] repositories using the accession numbers listed:
Assay | Accessions |
---|---|
GRO-seq | GSM1316306, GSM1316313, GSM1316320, GSM1316327, GSM1316334 |
H3K4me3 ChIP-seq | ERR208008, ERR208014, ERR207998, ERR20798, ERR207999 |
H3K4me1 ChIP-seq | GSM1316302, GSM1316303, GSM1316309, GSM1316316, GSM1316317, GSM1316310, GSM1316323, GSM1316324, GSM1316330, GSM1316331 |
H3K27ac ChIP-seq | GSM1316300, GSM1316301, GSM1316307, GSM1316308, GSM1316314, GSM1316315, GSM1316321, GSM1316322, GSM1316328, GSM1316329 |
Input ChIP-seq | ERR208001, ERR208012, ERR207984, ERR208011, ERR207986, GSM1316304, GSM1316305, GSM1316311, GSM1316312, GSM1316318, GSM1316319, GSM1316325, GSM1316326, GSM1316332, GSM1316333 |
RNA-seq | ERR266333, ERR266335, ERR266337, ERR266338, ERR266341, ERR266342, ERR266344, ERR266346, ERR266349, ERR266351 |
Main Scripts
- Compute TFSEE to identify cognate transcription factors are under 'analysis'
- Applicable to either enhancer method:
- Get H3K4me3 peaks: h3k4me3_processing.sh
- Get H3K27ac peaks: h3k27ac_processing.sh
- Get H3K4me1 peaks: h3k4me1_processing.sh
- Exclude regions based on H3K4me3 and promoters: excluded_regions_processing.sh
- RNA-seq analysis: RNA-seq_star.sh
- FPKM processing RNA-seq: rnaseq_processing.sh
- TFSEE using GRO-seq:
- Tune GroHMM: tune-hmm.sh
- Call Transcripts: call-transcripts.sh
- Make universe of Enhancers: groseq_processing.sh
- GRO_seq_TFSEE:
- TFSEE pre-processing: tfsee_processing.sh
- TFSEE score integration: matrix_analysis.py
- Rank order TF's clusters: rank_order.py
- TFSEE using histone modifications ChIP-seq:
- Make universe of Enhancers: histone_centered_processing.sh
- Histone_TFSEE:
- TFSEE pre-processing: tfsee_processing.sh
- TFSEE score integration: matrix_analysis.py
- Rank order TF's clusters: rank_order.py