Skip to content
Snippets Groups Projects
Venkat Malladi's avatar
04af3acb
Forked from Venkat Malladi / TFSEE
2 commits ahead of the upstream repository.

Total Functional Score of Enhancer Elements Identifies Lineage-Specific Enhancers that Drive Differentiation of Pancreatic Cells

This directory contains the scripts for identification of TFs maintaining multipotency of endodermal stem cells during differentiation into pancreatic lineages ,using TFSEE.

Dependencies for TFSEE

This code requires python 2.7+ to run.

The pythons scripts require the following python packages:

  • biopython-1.70
  • pandas-0.20.1
  • numpy-1.12.1
  • scikit-learn-0.18.1
  • matplotlib-2.0.2
  • seaborn-0.8.1
  • scipy-0.19.0

Install the dependencies.

pip install -r requirements.txt

Pipeline Description

Pre-processing Steps
  1. De novo identification of enhancers using GRO-seq and groHMM or ChIP-seq (H3K4me1 and H3K27ac)

  2. Normalize Enhancer Expression using GRO-seq: For each cell line, quantify the GRO-seq reads, RPKM, that fall within a 1 kb region around the center of the overlap for paired enhancer transcripts or from the 5′ end of unpaired enhancer transcripts

  3. Normalize Enhancer Expression using ChIP-seq: For each cell line, quantify the ChIP-seq reads, RPKM, from H3K4me1, H3K27ac, and input for each enhancer within the universe of GRO-seq-defined enhancers

  4. Motif Predictions: De novo motif analyses on a 1 kb region of expressed enhancers for each cell line using MEME and matched to known motifs using TOMTOM and JASPAR

  5. Normalize Transcription Factor Expression using RNA-seq: For each cell line, quantify the RNA-seq reads, FPKM, for each transcription factor that is a binding target for the motifs

  • RNA-seq analysis: RNA-seq_star.sh
  • FPKM processing RNA-seq: rnaseq_processing.sh
  1. Calculate TFSEE score to determining cell-type specific enhancer activity, generating:
  • unsupervised hierarchical clustering
  • tSNE representation
  • boxplot representations
  • rank order TF plots

Data Source

All dta available from NCBI’s Gene Expression Omnibus [@url:https://www.ncbi.nlm.nih.gov/geo/] or EMBL-EBI’s ArrayExpress [@url:http://www.ebi.ac.uk/arrayexpress/] repositories using the accession numbers listed:

Assay Accessions
GRO-seq GSM1316306, GSM1316313, GSM1316320, GSM1316327, GSM1316334
H3K4me3 ChIP-seq ERR208008, ERR208014, ERR207998, ERR20798, ERR207999
H3K4me1 ChIP-seq GSM1316302, GSM1316303, GSM1316309, GSM1316316, GSM1316317, GSM1316310, GSM1316323, GSM1316324, GSM1316330, GSM1316331
H3K27ac ChIP-seq GSM1316300, GSM1316301, GSM1316307, GSM1316308, GSM1316314, GSM1316315, GSM1316321, GSM1316322, GSM1316328, GSM1316329
Input ChIP-seq ERR208001, ERR208012, ERR207984, ERR208011, ERR207986, GSM1316304, GSM1316305, GSM1316311, GSM1316312, GSM1316318, GSM1316319, GSM1316325, GSM1316326, GSM1316332, GSM1316333
RNA-seq ERR266333, ERR266335, ERR266337, ERR266338, ERR266341, ERR266342, ERR266344, ERR266346, ERR266349, ERR266351

Main Scripts

  • Compute TFSEE to identify cognate transcription factors are under 'analysis'
  • Applicable to either enhancer method:
    • Get H3K4me3 peaks: h3k4me3_processing.sh
    • Get H3K27ac peaks: h3k27ac_processing.sh
    • Get H3K4me1 peaks: h3k4me1_processing.sh
    • Exclude regions based on H3K4me3 and promoters: excluded_regions_processing.sh
    • RNA-seq analysis: RNA-seq_star.sh
    • FPKM processing RNA-seq: rnaseq_processing.sh
  • TFSEE using GRO-seq:
    • Tune GroHMM: tune-hmm.sh
    • Call Transcripts: call-transcripts.sh
    • Make universe of Enhancers: groseq_processing.sh
    • GRO_seq_TFSEE:
      • TFSEE pre-processing: tfsee_processing.sh
      • TFSEE score integration: matrix_analysis.py
      • Rank order TF's clusters: rank_order.py
  • TFSEE using histone modifications ChIP-seq:
    • Make universe of Enhancers: histone_centered_processing.sh
    • Histone_TFSEE:
      • TFSEE pre-processing: tfsee_processing.sh
      • TFSEE score integration: matrix_analysis.py
      • Rank order TF's clusters: rank_order.py