README.md 13 KB
Newer Older
1
# **ChIP-seq Manual**
Venkat Malladi's avatar
Venkat Malladi committed
2
3
## Version 1.1.2
## June 21, 2020
Holly Ruess's avatar
Holly Ruess committed
4

5
6
# BICF ChIP-seq Pipeline

7
8
9
10
|*master*|*dev*|
|:-:|:-:|
|[![pipeline status](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/master/pipeline.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/master)|[![pipeline status](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/dev/pipeline.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/dev)|
|[![coverage report](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/master/coverage.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/master)|[![coverage report](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/dev/coverage.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/dev)|
Venkat Malladi's avatar
Venkat Malladi committed
11

12
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A50.31.0-brightgreen)](https://www.nextflow.io/)
13
[![Astrocyte](https://img.shields.io/badge/astrocyte-%E2%89%A50.3.1-blue)](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html)
Venkat Malladi's avatar
Venkat Malladi committed
14
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2648844.svg)](https://doi.org/10.5281/zenodo.2648844)
15
16
17


## Introduction
18
BICF ChIP-seq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Department of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
19
20
21

The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.

22
This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://portal.biohpc.swmed.edu/content/). However, the pipeline should be able to run on any system that supports Nextflow.
Venkat Malladi's avatar
Venkat Malladi committed
23

24
Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
Holly Ruess's avatar
Holly Ruess committed
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40

Current version of the software and issue reports are at
https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis

To download the current version of the software
```bash
$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/chipseq_analysis.git
```

## Input files
##### 1) Fastq Files
  + You will need the full path to the files for the Bash Scipt

##### 2) Design File
  + The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End.  Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
      1. sample_id          a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
Holly Ruess's avatar
Holly Ruess committed
41
      2. experiment_id    biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
Holly Ruess's avatar
Holly Ruess committed
42
43
44
45
46
47
48
49
50
      3. biosample          symbol for tissue type or cell line
      4. factor                 symbol for antibody target
      5. treatment           symbol of treatment applied
      6. replicate             a number, usually from 1-3 (i.e. 1)
      7. control_id          sample_id name that is the control for this sample
      8. fastq_read1        name of fastq file 1 for SE or PC data
      9. fastq_read2        name of fastq file 2 for PE data


Holly Ruess's avatar
Holly Ruess committed
51
  + See [HERE](test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
Holly Ruess's avatar
Holly Ruess committed
52
  + See [HERE](test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
Holly Ruess's avatar
Holly Ruess committed
53
54
55
56

##### 3) Bash Script
  + You will need to create a bash script to run the CHIPseq pipeline on [BioHPC](https://portal.biohpc.swmed.edu/content/)
  + This pipeline has been optimized for the correct partition
Holly Ruess's avatar
Holly Ruess committed
57
  + See [HERE](docs/CHIPseq.sh) for an example bash script
Holly Ruess's avatar
Holly Ruess committed
58
  + The parameters that must be specified are:
59
60
      - --reads '/path/to/files/name.fastq.gz'
      - --designFile '/path/to/file/design.txt',
Holly Ruess's avatar
Holly Ruess committed
61
62
      - --genome 'GRCm38', 'GRCh38', or 'GRCh37' (if you need to use another genome contact the [BICF](mailto:BICF@UTSouthwestern.edu))
      - --pairedEnd 'true' or 'false' (where 'true' is PE and 'false' is SE; default 'false')
63
64
65
66
      - --skipDiff 'true' or 'false' (where 'true' is skip differential peak and 'false' is do peak differential peak calling; default 'false')
      - --skipMotif 'true' or 'false' (where 'true' is skip motif calling and 'false' is do motif calling; default 'false')
      - --skipPlotProfile 'true' or 'false' (where 'true' is skip metageneplot for TSS and 'false' is do metageneplot for TSS; default 'false')
      - --outDir (optional) path and folder name of the output data, example: /home2/s000000/Desktop/Chipseq_output (if not specified will be under workflow/output/)
Holly Ruess's avatar
Holly Ruess committed
67
68
69
70

## Pipeline
  + There are 11 steps to the pipeline
    1. Check input files
Holly Ruess's avatar
Holly Ruess committed
71
    2. Trim adaptors TrimGalore!
Holly Ruess's avatar
Holly Ruess committed
72
    3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
Holly Ruess's avatar
Holly Ruess committed
73
    4. Mark duplicates with Sambamba, and filter reads with samtools
Holly Ruess's avatar
Holly Ruess committed
74
    5. Quality metrics with deep tools
Holly Ruess's avatar
Holly Ruess committed
75
    6. Calculate cross-correlation using PhantomPeakQualTools
Holly Ruess's avatar
Holly Ruess committed
76
77
    7. Call peaks with MACS
    8. Calculate consensus peaks
Holly Ruess's avatar
Holly Ruess committed
78
79
80
    9. Annotate all peaks using ChipSeeker
    10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
    11. Use MEME-ChIP to find motifs in original peaks
81
    12. Plot enrichment of signal around TSS
Holly Ruess's avatar
Holly Ruess committed
82

Holly Ruess's avatar
Holly Ruess committed
83
See [FLOWCHART](docs/flowchart.pdf)
Holly Ruess's avatar
Holly Ruess committed
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99

## Output Files
Folder | File | Description
--- | --- | ---
design | N/A | Inputs used for analysis; can ignore
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
alignReads | *.srt.bam | sorted bam file
filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
Holly Ruess's avatar
Holly Ruess committed
100
multiqcReport | multiqc_report.html | Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
Holly Ruess's avatar
Holly Ruess committed
101
102
103
104
105
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
experimentQC | sample_mbs.npz | array of multiple BAM summaries
Holly Ruess's avatar
Holly Ruess committed
106
107
108
109
110
111
crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios
crossReads | *.cc.qc | cross-correlation metrics. File [HEADER](docs/xcor_header.txt)
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
Holly Ruess's avatar
Holly Ruess committed
112
113
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
Holly Ruess's avatar
Holly Ruess committed
114
peakAnnotation | *.chipseeker_annotation.tsv | annotated narrowPeaks file
Holly Ruess's avatar
Holly Ruess committed
115
116
117
118
119
120
121
122
123
124
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
motifSearch | *_memechip/combined.meme | MEME identified motifs
diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
Jeremy Mathews's avatar
Jeremy Mathews committed
125
126
plotProfile | plotProfile.png | Plot profile of the TSS region
plotProfile | computeMatrix.gz | Compute Matrix from deeptools to create custom plots other than plotProfile
Holly Ruess's avatar
Holly Ruess committed
127
128
129

## Common Quality Control Metrics
  + These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
Holly Ruess's avatar
Holly Ruess committed
130
    1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
Holly Ruess's avatar
Holly Ruess committed
131
    2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
Holly Ruess's avatar
Holly Ruess committed
132
    3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location.  See [HERE](hhttps://ccg.epfl.ch//var/sib_april15/cases/landt12/strand_correlation.html) for more details.
Holly Ruess's avatar
Holly Ruess committed
133
134
    4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
    5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
Holly Ruess's avatar
Holly Ruess committed
135
136
137
138
139


## Common Errors
If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know and we will add it here.

Holly Ruess's avatar
Holly Ruess committed
140
## Citation
141
Please cite individual programs and versions used [HERE](docs/references.md), and the pipeline doi:[10.5281/zenodo.2648844](https://doi.org/10.5281/zenodo.2648844). Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
Holly Ruess's avatar
Holly Ruess committed
142

Holly Ruess's avatar
Holly Ruess committed
143
## Programs and Versions
Jeremy Mathews's avatar
Jeremy Mathews committed
144
145
146
147
148
149
150
151
152
153
154
155
  + python/3.6.1-2-anaconda [website](https://www.anaconda.com/download/#linux) [citation](docs/references.md)
  + trimgalore/0.4.1 [website](https://github.com/FelixKrueger/TrimGalore) [citation](docs/references.md)
  + cutadapt/1.9.1 [website](https://cutadapt.readthedocs.io/en/stable/index.html) [citation](docs/references.md)
  + bwa/intel/0.7.12 [website](http://bio-bwa.sourceforge.net/) [citation](docs/references.md)
  + samtools/1.6 [website](http://samtools.sourceforge.net/) [citation](docs/references.md)
  + sambamba/0.6.6 [website](http://lomereiter.github.io/sambamba/) [citation](docs/references.md)
  + bedtools/2.26.0 [website](https://bedtools.readthedocs.io/en/latest/) [citation](docs/references.md)
  + deeptools/2.5.0.1 [website](https://deeptools.readthedocs.io/en/develop/) [citation](docs/references.md)
  + phantompeakqualtools/1.2 [website](https://github.com/kundajelab/phantompeakqualtools) [citation](docs/references.md)
  + macs/2.1.0-20151222 [website](http://liulab.dfci.harvard.edu/MACS/) [citation](docs/references.md)
  + UCSC_userApps/v317 [website](https://genome.ucsc.edu/util.html) [citation](docs/references.md)
  + R/3.4.1 [website](https://www.r-project.org/) [citation](docs/references.md)
156
  + SPP/1.14
Jeremy Mathews's avatar
Jeremy Mathews committed
157
158
159
  + meme/4.11.1-gcc-openmpi [website](http://meme-suite.org/doc/install.html?man_type=web) [citation](docs/references.md)
  + ChIPseeker [website](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) [citation](docs/references.md)
  + DiffBind [website](https://bioconductor.org/packages/release/bioc/html/DiffBind.html) [citation](docs/references.md)
Holly Ruess's avatar
Holly Ruess committed
160
161


Holly Ruess's avatar
Holly Ruess committed
162

Holly Ruess's avatar
Holly Ruess committed
163
164
## Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).