Commit b39fcb4c authored by Holly Ruess's avatar Holly Ruess
Browse files

added manual

parent 21f00e65
Pipeline #2865 canceled with stages
in 15 seconds
# **CHIPseq Manual**
## Version 1.0.0
## January 2, 2019
# BICF ChIP-seq Pipeline
[![Build Status](](
......@@ -15,3 +19,136 @@ The pipeline uses [Nextflow](, a bioinformatics workflow
This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster]( However, the pipeline should be able to run on any system that Nextflow supports.
Additionally, the pipeline is designed to work with [Astrocyte Workflow System]( using a simple web interface.
Current version of the software and issue reports are at
To download the current version of the software
$ git clone
## Input files
##### 1) Fastq Files
+ You will need the full path to the files for the Bash Scipt
##### 2) Design File
+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id          a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
2. experiment_id    biosample_treatment_factor
3. biosample          symbol for tissue type or cell line
4. factor                 symbol for antibody target
5. treatment           symbol of treatment applied
6. replicate             a number, usually from 1-3 (i.e. 1)
7. control_id          sample_id name that is the control for this sample
8. fastq_read1        name of fastq file 1 for SE or PC data
9. fastq_read2        name of fastq file 2 for PE data
+ See [HERE]( for an example design file, paired-end
+ See [HERE]( for an example design file, single-end
##### 3) Bash Script
+ You will need to create a bash script to run the CHIPseq pipeline on [BioHPC](
+ This pipeline has been optimized for the correct partition
+ See [HERE]( for an example bash script
+ The parameters that must be specified are:
- --reads '/path/to/files/name.fastq.gz'
- --designFile '/path/to/file/design.txt',
- --genome 'GRCm38', 'GRCh38', or 'GRCh37' (if you need to use another genome contact the [BICF](
- --pairedEnd 'true' or 'false' (where 'true' is PE and 'false' is SE; default 'false')
- --outDir (optional) path and folder name of the output data, example: /home2/s000000/Desktop/Chipseq_output
## Pipeline
+ There are 11 steps to the pipeline
1. Check input files
2. Trim raw reads with trim galore
3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
4. Mark duplicates with sambamba, and filter reads with samtools
5. Quality metrics with deep tools
6. Calculate cross-correlation using phantompeaktools
7. Call peaks with MACS
8. Calculate consensus peaks
9. Annotate Peaks
10. Calculate Differential Binding Activity
11. Motif Search Peaks
See [FLOWCHART](https://git.biohpc.swmed.ed/bchen4/chipseq_analysis/raw/master/docs/flowchar.pdf)
## Output Files
Folder | File | Description
--- | --- | ---
design | N/A | Inputs used for analysis; can ignore
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
alignReads | *.srt.bam | sorted bam file
filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
experimentQC | sample_mbs.npz | array of multiple BAM summaries
crossReads | * | plot of cross-correlation to assess signal-to-noise ratios
crossReads | * | cross-correlation metrics. File [HEADER](https://git.biohpc.swmed.ed/bchen4/chipseq_analysis/raw/master/docs/xcor_header.txt)
callPeaksMACS | * | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | * | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | *_peaks.narrowPeak | peaks file; see [HERE]( for ENCODE narrowPeak header format
consensusPeaks | design_annotatePeaks.tsv | design file; can ignore
consensusPeaks | design_diffPeaks.csv | design file; can ignore
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | unique_experiments.csv | design file; can ignore
peakAnnotation | *.chipseeker_annotation.csv | annotated narrowPeaks file
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
motifSearch | *_memechip/ | MEME identified motifs
diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
## Common Quality Control Metrics
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1. filterReads/*.filt.nodup.pbc.qc: follow the ChiP-seq standards [HERE](; NRF>0.9, PBC1>0.9, and PBC2>10
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE]( for more details.
3. crossReads/* make sure your sample data has the correct signal intensity and location. See [HERE]( for more details.
4. crossReads/* Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE]( for more details.
## Common Errors
If you find an error, please let the [BICF]( know and we will add it here.
## Programs and Versions
+ python/3.6.1-2-anaconda [website]( [citation](
+ trimgalore/0.4.1 [website]( [citation](
+ cutadapt/1.9.1 [website]( [citation](
+ bwa/intel/0.7.12 [website]( [citation](
+ samtools/1.6 [website]( [citation](
+ sambamba/0.6.6 [website]( [citation](
+ bedtools/2.26.0 [website]( [citation](
+ deeptools/ [website]( [citation](
+ phantompeakqualtools/1.2 [website]( [citation](
+ macs/2.1.0-20151222 [website]( [citation](
+ UCSC_userApps/v317 [website]( [citation](
+ R/3.3.2-gccmkl [website]( [citation](
+ meme/4.11.1-gcc-openmpi [website]( [citation](
+ ChIPseeker [website]( [citation](
+ DiffBind [website]( [citation](
## Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](, in the [Department of Bioinformatics](
## Citation
Please cite individual programs and versions used [HERE]( Also, please look out for our pipeline to be published in the future [HERE](
#SBATCH --job-name=CHIPseq
#SBATCH --partition=super
#SBATCH --output=CHIPseq.%j.out
#SBATCH --error=CHIPseq.%j.err
module load nextflow/0.31.0
module add python/3.6.1-2-anaconda
nextflow run workflow/ \
--reads '/path/to/*fastq.gz' \
--designFile '/path/to/design.txt' \
--genome 'GRCm38' \
--pairedEnd 'true'
Anaconda (Anaconda Software Distribution,
trimgalore/0.4.1 (
Marcel, M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1):10-12. DOI:
Li H., and R. Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25: 1754-60.
Li H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: 2078-9.
Tarasov, A., A. J. Vilella, E. Cuppen, I. J. Nijman, and P. Prins. 2015 Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12): 2032-2034. doi:10.1093/bioinformatics/btv098.
Quinlan, A. R., and I. M. Hall. 2010. BEDTools: a flexible suite of utilities for comparing genomic feautures. Bioinformatics 26(6): 841-842. doi:10.1093/bioinformatics/btq033
Ramírez, F., D. P. Ryan, B. Grüning, V. Bhardwaj, F. Kilpert, A. S. Richter, S. Heyne, F. Dündar, and T. Manke. 2016. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Research 44: W160-165. doi: 10.1093/nar/gkw257.
Landt S. G., G. K. Marinov, A. Kundaje, et al. 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 9: 1813-31. doi: 10.1101/gr.136184.111.
Kharchenko P. K., M. Y. Tolstorukov, and P. J. Park. 2008. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 26(12): 1351-1359.
Zhang Y., T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, R. M. Myers, M. Brown, W. Li, and X. S. Liu. 2008. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9: R137.
Kent W. J., A. S. Zweig, G. Barber, A. S. Hinrichs, and D. Karolchik. BigWig and BigBed: enabling browsing of large distributed data sets. Bioinformatics 26(17): 2204-2207.
R Core Team 2014. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL
Bailey T. L., M. Bodén, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research 37: W202-W208.
Machanick P., and T. L. Bailey. 2011. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27(12): 1696-1697.
R ChIPseeker:
Yu G., L. Wang, and Q. He. 2015. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31(14): 2382-2383. doi: 10.1093/bioinformatics/btv145.
R DiffBind:
Stark R., and G. Brown. 2011. DiffBind: differential binding analysis of ChIP-Seq peak data.
Ross-Innes C. S., R. Stark, A. E. Teschendorff, K. A. Holmes, H. R. Ali, M. J. Dunning, G. D. Brown, O. Gojis, I. O. Ellis, A. R. Green, S. Ali, S. Chin, C. Palmieri, C. Caldas, and J. S. Carroll. 2012. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481: 389-393.
See for more details
COL1: Filename: tagAlign/BAM filename
COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.
The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output.
In almost all cases, the top (first) value in the list represents the predominant fragment length.
If you want to keep only the top value simply run
sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
COL5: phantomPeak: Read length/phantom peak strand shift
COL6: corr_phantomPeak: Correlation value at phantom peak
COL7: argmin_corr: strand shift at which cross-correlation is lowest
COL8: min_corr: minimum value of cross-correlation
COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment