Merge branch '8-update-documentation' into 'master'

Resolve "Update documentation" Closes #8 See merge request BICF/Astrocyte/chipseq_analysis!35

Merge branch '8-update-documentation' into 'master'
Resolve "Update documentation" Closes #8 See merge request BICF/Astrocyte/chipseq_analysis!35
304683ef · Venkat Malladi · cfdc3a98 · aa328cd4 · 304683ef · 304683ef
Commit 304683ef authored 5 years ago by Venkat Malladi
--- a/README.md
+++ b/README.md
+# **CHIPseq Manual**
+## Version 1.0.0
+## May 2, 2019
 # BICF ChIP-seq Pipeline
 [![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/master)
@@ -9,10 +13,144 @@
 ## Introduction
-BICF ChIPseq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Dept. of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
+BICF ChIPseq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Department of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
 The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
-This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://biohpc.swmed.edu/). However, the pipeline should be able to run on any system that Nextflow supports.
+This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://portal.biohpc.swmed.edu/content/). However, the pipeline should be able to run on any system that supports Nextflow.
 Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
+Current version of the software and issue reports are at
+https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis
+To download the current version of the software
+```bash
+$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/chipseq_analysis.git
+```
+## Input files
+##### 1) Fastq Files
+  + You will need the full path to the files for the Bash Scipt
+##### 2) Design File
+  + The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End.  Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
+      1. sample_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
+      2. experiment_id&nbsp;&nbsp;&nbsp;&nbsp;biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
+      3. biosample&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for tissue type or cell line
+      4. factor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for antibody target
+      5. treatment&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol of treatment applied
+      6. replicate&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a number, usually from 1-3 (i.e. 1)
+      7. control_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sample_id name that is the control for this sample
+      8. fastq_read1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 1 for SE or PC data
+      9. fastq_read2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 2 for PE data
+  + See [HERE](test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
+  + See [HERE](test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
+##### 3) Bash Script
+  + You will need to create a bash script to run the CHIPseq pipeline on [BioHPC](https://portal.biohpc.swmed.edu/content/)
+  + This pipeline has been optimized for the correct partition
+  + See [HERE](docs/CHIPseq.sh) for an example bash script
+  + The parameters that must be specified are:
+      - --reads '/path/to/files/name.fastq.gz'
+      - --designFile '/path/to/file/design.txt',
+      - --genome 'GRCm38', 'GRCh38', or 'GRCh37' (if you need to use another genome contact the [BICF](mailto:BICF@UTSouthwestern.edu))
+      - --pairedEnd 'true' or 'false' (where 'true' is PE and 'false' is SE; default 'false')
+      - --outDir (optional) path and folder name of the output data, example: /home2/s000000/Desktop/Chipseq_output (if not specficied will be under workflow/output/)
+## Pipeline
+  + There are 11 steps to the pipeline
+    1. Check input files
+    2. Trim adaptors TrimGalore!
+    3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
+    4. Mark duplicates with Sambamba, and filter reads with samtools
+    5. Quality metrics with deep tools
+    6. Calculate cross-correlation using PhantomPeakQualTools
+    7. Call peaks with MACS
+    8. Calculate consensus peaks
+    9. Annotate all peaks using ChipSeeker
+    10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
+    11. Use MEME-ChIP to find motifs in original peaks
+See [FLOWCHART](docs/flowchart.pdf)
+## Output Files
+Folder | File | Description
+--- | --- | ---
+design | N/A | Inputs used for analysis; can ignore
+trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
+trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
+alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
+alignReads | *.srt.bam | sorted bam file
+filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
+filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
+filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
+filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
+filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
+convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
+convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
+multiqcReport | multiqc_report.html | Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
+experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
+experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
+experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
+experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
+experimentQC | sample_mbs.npz | array of multiple BAM summaries
+crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios
+crossReads | *.cc.qc | cross-correlation metrics. File [HEADER](docs/xcor_header.txt)
+callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
+callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
+callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
+callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
+consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
+consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
+peakAnnotation | *.chipseeker_annotation.tsv | annotated narrowPeaks file
+peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
+peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
+motifSearch | *_memechip/index.html | interactive HTML link of MEME output
+motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
+motifSearch | *_memechip/combined.meme | MEME identified motifs
+diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
+diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
+diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
+diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
+diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
+## Common Quality Control Metrics
+  + These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
+    1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
+    2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
+    3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location.  See [HERE](hhttps://ccg.epfl.ch//var/sib_april15/cases/landt12/strand_correlation.html) for more details.
+    4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
+    5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
+## Common Errors
+If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know and we will add it here.
+## Citation
+Please cite individual programs and versions used [HERE](docs/references.txt). Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596). Also, please look out for our pipeline to be published in the future [HERE](https://zenodo.org/).
+## Programs and Versions
+  + python/3.6.1-2-anaconda [website](https://www.anaconda.com/download/#linux) [citation](docs/references.txt)
+  + trimgalore/0.4.1 [website](https://github.com/FelixKrueger/TrimGalore) [citation](docs/references.txt)
+  + cutadapt/1.9.1 [website](https://cutadapt.readthedocs.io/en/stable/index.html) [citation](docs/references.txt)
+  + bwa/intel/0.7.12 [website](http://bio-bwa.sourceforge.net/) [citation](docs/references.txt)
+  + samtools/1.6 [website](http://samtools.sourceforge.net/) [citation](docs/references.txt)
+  + sambamba/0.6.6 [website](http://lomereiter.github.io/sambamba/) [citation](docs/references.txt)
+  + bedtools/2.26.0 [website](https://bedtools.readthedocs.io/en/latest/) [citation](docs/references.txt)
+  + deeptools/2.5.0.1 [website](https://deeptools.readthedocs.io/en/develop/) [citation](docs/references.txt)
+  + phantompeakqualtools/1.2 [website](https://github.com/kundajelab/phantompeakqualtools) [citation](docs/references.txt)
+  + macs/2.1.0-20151222 [website](http://liulab.dfci.harvard.edu/MACS/) [citation](docs/references.txt)
+  + UCSC_userApps/v317 [website](https://genome.ucsc.edu/util.html) [citation](docs/references.txt)
+  + R/3.4.1 [website](https://www.r-project.org/) [citation](docs/references.txt)
+  + SPP/1.14
+  + meme/4.11.1-gcc-openmpi [website](http://meme-suite.org/doc/install.html?man_type=web) [citation](docs/references.txt)
+  + ChIPseeker [website](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) [citation](docs/references.txt)
+  + DiffBind [website](https://bioconductor.org/packages/release/bioc/html/DiffBind.html) [citation](docs/references.txt)
+## Credits
+This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
--- a/astrocyte_pkg.yml
+++ b/astrocyte_pkg.yml
@@ -9,7 +9,7 @@
 # A unique identifier for the workflow package, text/underscores only
 name: 'chipseq_analysis_bicf'
 # Who wrote this?
-author: 'Beibei Chen and Venkat Malladi'
+author: 'Holly Ruess, Spencer D. Barnes, Beibei Chen and Venkat Malladi'
 # A contact email address for questions
 email: 'bicf@utsouthwestern.edu'
 # A more informative title for the workflow package

--- a/docs/CHIPseq.sh
+++ b/docs/CHIPseq.sh
+#!/bin/bash
+#SBATCH --job-name=CHIPseq
+#SBATCH --partition=super
+#SBATCH --output=CHIPseq.%j.out
+#SBATCH --error=CHIPseq.%j.err
+module load nextflow/0.31.0
+module add  python/3.6.1-2-anaconda
+nextflow run workflow/main.nf \
+--reads '/path/to/*fastq.gz' \
+--designFile '/path/to/design.txt' \
+--genome 'GRCm38' \
+--pairedEnd 'true'
--- a/docs/design_example.txt
+++ b/docs/design_example.txt
@@ -2,4 +2,4 @@ sample_id	experiment_id	biosample	factor	treatment	replicate	control_id	fastq_re
 A1	A	tissueA	H3K27AC	None	1	B1	A1.fastq.gz
 A2	A	tissueA	H3K27AC	None	2	B2	A2.fastq.gz
 B1	B	tissueB	Input	None	1	B1	B1.fastq.gz
-B2	A	tissueB	Input	None	2	B2	B2.fastq.gz
+B2	B	tissueB	Input	None	2	B2	B2.fastq.gz
--- a/docs/flowchart.pdf
+++ b/docs/flowchart.pdf
--- a/docs/index.md
+++ b/docs/index.md
@@ -5,55 +5,121 @@
 The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
-### Pipeline Steps
+Report issues to the Bioinformatic Core Facility [BICF](mailto:BICF@UTSouthwestern.edu)
-  1) Trim adaptors TrimGalore!
+### Pipeline Steps
-  2) Align with BWA
+  + There are 11 steps to the pipeline
-  3) Filter reads with Sambamba  S
+    1. Check input files
-  4) Quality control with DeepTools
+    2. Trim adaptors TrimGalore!
-  5) Calculate Cross-correlation using SPP and PhantomPeakQualTools
+    3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
-  6) Signal profiling using MACS2
+    4. Mark duplicates with Sambamba, and filter reads with samtools
-  7) Call consenus peaks
+    5. Quality metrics with deep tools
-  8) Annotate all peaks using ChipSeeker
+    6. Calculate cross-correlation using PhantomPeakQualTools
-  9) Use MEME-ChIP to find motifs in original peaks
+    7. Call peaks with MACS
-  10) Find differential expressed peaks using DiffBind (If more than 1 experiment)
+    8. Calculate consensus peaks
+    9. Annotate all peaks using ChipSeeker
+    10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
+    11. Use MEME-ChIP to find motifs in original peaks
 ## Workflow Parameters
+    1. One or more input FASTQ files from a ChIP-seq expereiment and a design file with the link bewetwen the same file name and sample id (required) - Choose all ChIP-seq fastq files for analysis.
-    reads - Choose all ChIP-seq fastq files for analysis.
+    2. In single-end sequencing, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs. In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment. (Paired-end: True, Single-end: False) (required)
-    pairedEnd - Choose True/False if data is paired-end
+    3. A design file listing sample id, fastq files, corresponding control id and additional information about the sample. 
-    design - Choose the file with the experiment design information. TSV format
    genome - Choose a genomic reference (genome).
-    skipDiff - Choose True/False if data if you want to run Differential Peaks
+    4. Reference species and genome used for alignment and subsequent analysis. (required)
-    skipMotif - Choose True/False if data if you want to run Motif Calling
+    5. Run differential peak analysis (required). Must have at least 2 replicates per experiment and at least 2 experiments.
+    6. Run motif calling (required). Top 600 peaks sorted by p-value.
+    7. Ensure configuraton for astrocyte. (required; always true)
 ## Design file
- The following columns are necessary, must be named as in template. An design file template can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/design_example.txt)
+  + The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End.  Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
+      1. sample_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
-    sample_id
+      2. experiment_id&nbsp;&nbsp;&nbsp;&nbsp;biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
-        The id of the sample. This will be the name used in output files, please make sure it is concise and informative.
+      3. biosample&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for tissue type or cell line
-    experiment_id
+      4. factor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for antibody target
-        The id of the experiment. Used for grouping replicates.
+      5. treatment&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol of treatment applied
-    biosample
+      6. replicate&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a number, usually from 1-3 (i.e. 1)
-        The name of the biological sample.
+      7. control_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sample_id name that is the control for this sample
-    factor
+      8. fastq_read1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 1 for SE or PC data
-        Factor of the experiment.
+      9. fastq_read2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 2 for PE data
-    treatment
-        Treatment used in experiment.
-    replicate
+  + See [HERE](test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
-        Replicate number.
+  + See [HERE](test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
-    control_id
-	    The sample_id of the control used for this sample.
+## Output Files
-    fastq_read1
-      File name of fastq file, if paired-end this is read1.
+Folder | File | Description
-    fastq_read2
+--- | --- | ---
-      File name of read2 (for paired-end), not needed for single-end data.
+design | N/A | Inputs used for analysis; can ignore
+trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
+trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
+alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
+alignReads | *.srt.bam | sorted bam file
+filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
+filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
+filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
+filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
+filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
+convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
+convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
+multiqcReport | multiqc_report.html | Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
+experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
+experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
+experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
+experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
+experimentQC | sample_mbs.npz | array of multiple BAM summaries
+crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios
+crossReads | *.cc.qc | cross-correlation metrics. File [HEADER](docs/xcor_header.txt)
+callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
+callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
+callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
+callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
+consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
+consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
+peakAnnotation | *.chipseeker_annotation.tsv | annotated narrowPeaks file
+peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
+peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
+motifSearch | *_memechip/index.html | interactive HTML link of MEME output
+motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
+motifSearch | *_memechip/combined.meme | MEME identified motifs
+diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
+diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
+diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
+diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
+diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
+## Common Quality Control Metrics
+  + These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
+    1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
+    2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
+    3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location.  See [HERE](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/06_QC_cross_correlation.html) for more details.
+    4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
+    5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
 ### Credits
-This worklow is was developed jointly with the [Bioinformatic Core Facility (BICF), Department of Bioinformatics](http://www.utsouthwestern.edu/labs/bioinformatics/)
+This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
+Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
+### References
-Please cite in publications: Pipeline was developed by BICF from funding provided by **Cancer Prevention and Research Institute of Texas (RP150596)**.
+  + python/3.6.1-2-anaconda [website](https://www.anaconda.com/download/#linux) [citation](docs/references.txt)
+  + trimgalore/0.4.1 [website](https://github.com/FelixKrueger/TrimGalore) [citation](docs/references.txt)
+  + cutadapt/1.9.1 [website](https://cutadapt.readthedocs.io/en/stable/index.html) [citation](docs/references.txt)
+  + bwa/intel/0.7.12 [website](http://bio-bwa.sourceforge.net/) [citation](docs/references.txt)
+  + samtools/1.6 [website](http://samtools.sourceforge.net/) [citation](docs/references.txt)
+  + sambamba/0.6.6 [website](http://lomereiter.github.io/sambamba/) [citation](docs/references.txt)
+  + bedtools/2.26.0 [website](https://bedtools.readthedocs.io/en/latest/) [citation](docs/references.txt)
+  + deeptools/2.5.0.1 [website](https://deeptools.readthedocs.io/en/develop/) [citation](docs/references.txt)
+  + phantompeakqualtools/1.2 [website](https://github.com/kundajelab/phantompeakqualtools) [citation](docs/references.txt)
+  + macs/2.1.0-20151222 [website](http://liulab.dfci.harvard.edu/MACS/) [citation](docs/references.txt)
+  + UCSC_userApps/v317 [website](https://genome.ucsc.edu/util.html) [citation](docs/references.txt)
+  + R/3.3.2-gccmkl [website](https://www.r-project.org/) [citation](docs/references.txt)
+  + meme/4.11.1-gcc-openmpi [website](http://meme-suite.org/doc/install.html?man_type=web) [citation](docs/references.txt)
+  + ChIPseeker [website](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) [citation](docs/references.txt)
+  + DiffBind [website](https://bioconductor.org/packages/release/bioc/html/DiffBind.html) [citation](docs/references.txt)
--- a/docs/references.txt
+++ b/docs/references.txt
+python/3.6.1-2-anaconda:
+Anaconda (Anaconda Software Distribution, https://anaconda.com)
+trimgalore/0.4.1:
+trimgalore/0.4.1 (https://github.com/FelixKrueger/TrimGalore)
+cutadapt/1.9.1:
+Marcel, M. 2011. Cutadapt removes adapter sequences from high-throughput sequencing reads. EMBnet.journal 17(1):10-12. DOI: http://dx.doi.org/10.14806/ej.17.1.200
+bwa/intel/0.7.12:
+Li H., and R. Durbin. 2009. Fast and accurate short read alignment with Burrows-Wheeler Transform. Bioinformatics 25: 1754-60. 
+samtools/1.6:
+Li H., B. Handsaker, A. Wysoker, T. Fennell, J. Ruan, N. Homer, G. Marth, G. Abecasis, R. Durbin, and 1000 Genome Project Data Processing Subgroup. 2009. The Sequence alignment/map (SAM) format and SAMtools. Bioinformatics 25: 2078-9.
+sambamba/0.6.6:
+Tarasov, A., A. J. Vilella, E. Cuppen, I. J. Nijman, and P. Prins. 2015 Sambamba: fast processing of NGS alignment formats. Bioinformatics 31(12): 2032-2034. doi:10.1093/bioinformatics/btv098.
+bedtools/2.26.0:
+Quinlan, A. R., and I. M. Hall. 2010. BEDTools: a flexible suite of utilities for comparing genomic feautures. Bioinformatics 26(6): 841-842. doi:10.1093/bioinformatics/btq033
+deeptools/2.5.0.1:
+Ramírez, F., D. P. Ryan, B. Grüning, V. Bhardwaj, F. Kilpert, A. S. Richter, S. Heyne, F. Dündar, and T. Manke. 2016. deepTools2: a next generation web server for deep-sequencing data analysis. Nucleic Acids Research 44: W160-165. doi: 10.1093/nar/gkw257.
+phantompeakqualtools/1.2:
+Landt S. G., G. K. Marinov, A. Kundaje, et al. 2012. ChIP-seq guidelines and practices of the ENCODE and modENCODE consortia. Genome Res 9: 1813-31. doi: 10.1101/gr.136184.111.
+Kharchenko P. K., M. Y. Tolstorukov, and P. J. Park. 2008. Design and analysis of ChIP-seq experiments for DNA-binding proteins. Nat Biotechnol 26(12): 1351-1359.
+macs/2.1.0-20151222:
+Zhang Y., T. Liu, C. A. Meyer, J. Eeckhoute, D. S. Johnson, B. E. Bernstein, C. Nusbaum, R. M. Myers, M. Brown, W. Li, and X. S. Liu. 2008. Model-based Analysis of ChIP-Seq (MACS). Genome Biol 9: R137.
+UCSC_userApps/v317
+Kent W. J., A. S. Zweig, G. Barber, A. S. Hinrichs, and D. Karolchik. BigWig and BigBed: enabling browsing of large distributed data sets. Bioinformatics 26(17): 2204-2207.
+R/3.3.2-gccmkl:
+R Core Team 2014. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria. URL http://www.R-project.org/.
+meme/4.11.1-gcc-openmpi:
+Bailey T. L., M. Bodén, F. A. Buske, M. Frith, C. E. Grant, L. Clementi, J. Ren, W. W. Li, and W. S. Noble. 2009. MEME SUITE: tools for motif discovery and searching. Nucleic Acids Research 37: W202-W208.
+Machanick P., and T. L. Bailey. 2011. MEME-ChIP: motif analysis of large DNA datasets. Bioinformatics 27(12): 1696-1697.
+R ChIPseeker:
+Yu G., L. Wang, and Q. He. 2015. ChIPseeker: an R/Bioconductor package for ChIP peak annotation, comparison and visualization. Bioinformatics 31(14): 2382-2383. doi: 10.1093/bioinformatics/btv145.
+R DiffBind:
+Stark R., and G. Brown. 2011. DiffBind: differential binding analysis of ChIP-Seq peak data. http://bioconductor.org/packages/release/bioc/vignettes/DiffBind/inst/doc/DiffBind.pdf.
+Ross-Innes C. S., R. Stark, A. E. Teschendorff, K. A. Holmes, H. R. Ali, M. J. Dunning,  G. D. Brown, O. Gojis, I. O. Ellis, A. R. Green, S. Ali, S. Chin, C. Palmieri, C. Caldas, and J. S. Carroll. 2012. Differential oestrogen receptor binding is associated with clinical outcome in breast cancer. Nature 481: 389-393. http://www.nature.com/nature/journal/v481/n7381/full/nature10730.html.
--- a/docs/xcor_header.txt
+++ b/docs/xcor_header.txt
+See https://github.com/crazyhottommy/phantompeakqualtools for more details
+COL1: Filename: tagAlign/BAM filename
+COL2: numReads: effective sequencing depth i.e. total number of mapped reads in input file
+COL3: estFragLen: comma separated strand cross-correlation peak(s) in decreasing order of correlation.
+	  The top 3 local maxima locations that are within 90% of the maximum cross-correlation value are output.
+	  In almost all cases, the top (first) value in the list represents the predominant fragment length.
+	  If you want to keep only the top value simply run
+	  sed -r 's/,[^\t]+//g' <outFile> > <newOutFile>
+COL4: corr_estFragLen: comma separated strand cross-correlation value(s) in decreasing order (col2 follows the same order)
+COL5: phantomPeak: Read length/phantom peak strand shift
+COL6: corr_phantomPeak: Correlation value at phantom peak
+COL7: argmin_corr: strand shift at which cross-correlation is lowest
+COL8: min_corr: minimum value of cross-correlation
+COL9: Normalized strand cross-correlation coefficient (NSC) = COL4 / COL8
+COL10: Relative strand cross-correlation coefficient (RSC) = (COL4 - COL8) / (COL6 - COL8)
+COL11: QualityTag: Quality tag based on thresholded RSC (codes: -2:veryLow,-1:Low,0:Medium,1:High,2:veryHigh)
--- a/workflow/main.nf
+++ b/workflow/main.nf
@@ -521,7 +521,8 @@ process motifSearch {
  script:
  """
-  module load R/3.3.2-gccmkl
+  module load meme/4.11.1-gcc-openmpi
+  module load bedtools/2.26.0
  python3 $baseDir/scripts/motif_search.py -d $designMotifSearch -g $fasta -p $topPeakCount
  """
 }
@@ -556,8 +557,7 @@ process diffPeaks {
  """
  module load python/3.6.1-2-anaconda
-  module load meme/4.11.1-gcc-openmpi
+  module load R/3.3.2-gccmkl
-  module load bedtools/2.26.0
  Rscript $baseDir/scripts/diff_peaks.R $designDiffPeaks
  """
 }

--- a/workflow/tests/test_diff_peaks.py
+++ b/workflow/tests/test_diff_peaks.py
@@ -44,7 +44,7 @@ def test_diffbind_singleend_multiple_rep():
        assert os.path.exists(os.path.join(test_output_path, 'ENCSR238SGC_vs_ENCSR272GNQ_diffbind.bed'))
        diffbind_file = test_output_path + 'ENCSR238SGC_vs_ENCSR272GNQ_diffbind.csv'
    assert os.path.exists(diffbind_file)
-    assert utils.count_lines(diffbind_file) == 197471
+    assert utils.count_lines(diffbind_file) == 201217
 @pytest.mark.paireddiff