Commit 08afbefe authored by Venkat Malladi's avatar Venkat Malladi
Browse files

Merge branch 'master' into 26-negative-fraglength

parents 177d794f 60b877d9
......@@ -97,3 +97,16 @@ ENV/
# mypy
.mypy_cache/
# Mac OS
.DS_Store
# nextflow analysis folders/files
pipeline_trace*.txt*
.nextflow*.log*
report*.html*
timeline*.html*
/workflow/output/*
/work/*
/test_data/*
/.nextflow/*
before_script:
- module add python/3.6.1-2-anaconda
- pip install --user pytest-pythonpath pytest-cov
- pip install --user pytest-pythonpath==0.7.1 pytest-cov==2.5.1
- module load nextflow/0.31.0
- ln -s /work/BICF/s163035/chipseq/*fastq.gz test_data/
- ln -s /project/shared/bicf_workflow_ref/workflow_testdata/chipseq/*fastq.gz test_data/
stages:
- unit
- astrocyte
- single
- multiple
- skip
user_configuration:
stage: unit
......@@ -28,32 +29,46 @@ astrocyte:
single_end_mouse:
stage: single
only:
- master
script:
- nextflow run workflow/main.nf -resume
- nextflow run workflow/main.nf --astrocyte true -resume
- pytest -m singleend
artifacts:
expire_in: 2 days
paired_end_human:
stage: single
only:
- branches
except:
- master
script:
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR729LGA_PE.txt" --genome 'GRCh38' --pairedEnd true -resume
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR729LGA_PE.txt" --genome 'GRCh38' --pairedEnd true --astrocyte false -resume
- pytest -m pairedend
artifacts:
expire_in: 2 days
single_end_diff:
stage: multiple
only:
- branches
except:
- master
script:
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_diff_SE.txt" --genome 'GRCm38' -resume
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_diff_SE.txt" --genome 'GRCm38' --astrocyte false -resume
- pytest -m singleend
- pytest -m singlediff
artifacts:
expire_in: 2 days
paired_end_diff:
only:
- master
stage: multiple
script:
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_diff_PE.txt" --genome 'GRCh38' --pairedEnd true -resume
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_diff_PE.txt" --genome 'GRCh38' --pairedEnd true --astrocyte false -resume
- pytest -m pairedend
- pytest -m paireddiff
artifacts:
expire_in: 2 days
single_end_skip:
stage: skip
only:
- master
script:
- nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_diff_SE.txt" --genome 'GRCm38' --skipDiff true --skipMotif true --skipPlotProfile true --astrocyte false -resume
- pytest -m singleskip_true
Please fill in the appropriate checklist below (delete whatever is not relevant).
These are the most common things requested on pull requests (PRs).
## PR checklist
- [ ] This comment contains a description of changes (with reason)
- [ ] If you've fixed a bug or added code that should be tested, add tests!
- [ ] Documentation in `docs` is updated
- [ ] `CHANGELOG.md` is updated
- [ ] `README.md` is updated
- [ ] `LICENSE.md` is updated with new contributors
# Changelog
All notable changes to this project will be documented in this file.
## [Unreleased]
- Fix references.md link in citation of README.md
- Add Nextflow to references.md
- Fix pool_and_psuedoreplicate.py to run single experiment
- Add test data for test_pool_and_pseudoreplicate
- Add PlotProfile Option
- Add Python version to MultiQC
- Add and Update tests
- Use GTF files instead of TxDb and org libraries in Annotate Peaks
- Make gtf and geneName files as param inputs
- Fix xcor to increase file size for --random-source
- Fix skip diff test for paired-end data
## [publish_1.0.6 ] - 2019-05-31
### Added
- MIT LICENSE
- Check for correct genome input
- Check for correct output for motif serach
### Fixed
- Path to reference fasta file
## [publish_1.0.5 ] - 2019-05-16
### Fixed
- Retitled documentation for skipDiff and skipMotif to be more clear
- Add missing python module for motif search
- Updated links for phantompeaktools and references
- Fix callPeaks for single control experiments
## [publish_1.0.3 ] - 2019-04-23
### Fixed
- Variable name for design file for Astrocyte
- File path for design file to work for Astrocyte
## [publish_1.0.0 ] - 2019-04-23
Initial release of pipeline
MIT License
Copyright (c) 2019, University of Texas Southwestern Medical Center.
All rights reserved.
Contributors: Spencer D. Barnes, Holly Ruess, Jeremy A. Mathews, Beibei Chen, Venkat S. Malladi
Department: Bioinformatic Core Facility, Department of Bioinformatics
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.
# **CHIPseq Manual**
## Version 1.0.6
## May 31, 2019
# BICF ChIP-seq Pipeline
[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/commits/master)
......@@ -5,13 +9,150 @@
[![Nextflow](https://img.shields.io/badge/nextflow-%E2%89%A50.24.0-brightgreen.svg
)](https://www.nextflow.io/)
[![Astrocyte](https://img.shields.io/badge/astrocyte-%E2%89%A50.1.0-blue.svg)](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html)
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2648845.svg)](https://doi.org/10.5281/zenodo.2648845)
## Introduction
BICF ChIPseq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Dept. of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
BICF ChIPseq is a bioinformatics best-practice analysis pipeline used for ChIP-seq (chromatin immunoprecipitation sequencing) data analysis at [BICF](http://www.utsouthwestern.edu/labs/bioinformatics/) at [UT Southwestern Department of Bioinformatics](http://www.utsouthwestern.edu/departments/bioinformatics/).
The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://biohpc.swmed.edu/). However, the pipeline should be able to run on any system that Nextflow supports.
This pipeline is primarily used with a SLURM cluster on the [BioHPC Cluster](https://portal.biohpc.swmed.edu/content/). However, the pipeline should be able to run on any system that supports Nextflow.
Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
Current version of the software and issue reports are at
https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis
To download the current version of the software
```bash
$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/chipseq_analysis.git
```
## Input files
##### 1) Fastq Files
+ You will need the full path to the files for the Bash Scipt
##### 2) Design File
+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id          a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
2. experiment_id    biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
3. biosample          symbol for tissue type or cell line
4. factor                 symbol for antibody target
5. treatment           symbol of treatment applied
6. replicate             a number, usually from 1-3 (i.e. 1)
7. control_id          sample_id name that is the control for this sample
8. fastq_read1        name of fastq file 1 for SE or PC data
9. fastq_read2        name of fastq file 2 for PE data
+ See [HERE](test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
+ See [HERE](test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
##### 3) Bash Script
+ You will need to create a bash script to run the CHIPseq pipeline on [BioHPC](https://portal.biohpc.swmed.edu/content/)
+ This pipeline has been optimized for the correct partition
+ See [HERE](docs/CHIPseq.sh) for an example bash script
+ The parameters that must be specified are:
- --reads '/path/to/files/name.fastq.gz'
- --designFile '/path/to/file/design.txt',
- --genome 'GRCm38', 'GRCh38', or 'GRCh37' (if you need to use another genome contact the [BICF](mailto:BICF@UTSouthwestern.edu))
- --pairedEnd 'true' or 'false' (where 'true' is PE and 'false' is SE; default 'false')
- --outDir (optional) path and folder name of the output data, example: /home2/s000000/Desktop/Chipseq_output (if not specficied will be under workflow/output/)
## Pipeline
+ There are 11 steps to the pipeline
1. Check input files
2. Trim adaptors TrimGalore!
3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
4. Mark duplicates with Sambamba, and filter reads with samtools
5. Quality metrics with deep tools
6. Calculate cross-correlation using PhantomPeakQualTools
7. Call peaks with MACS
8. Calculate consensus peaks
9. Annotate all peaks using ChipSeeker
10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
11. Use MEME-ChIP to find motifs in original peaks
See [FLOWCHART](docs/flowchart.pdf)
## Output Files
Folder | File | Description
--- | --- | ---
design | N/A | Inputs used for analysis; can ignore
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
alignReads | *.srt.bam | sorted bam file
filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
multiqcReport | multiqc_report.html | Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
experimentQC | sample_mbs.npz | array of multiple BAM summaries
crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios
crossReads | *.cc.qc | cross-correlation metrics. File [HEADER](docs/xcor_header.txt)
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
peakAnnotation | *.chipseeker_annotation.tsv | annotated narrowPeaks file
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
motifSearch | *_memechip/combined.meme | MEME identified motifs
diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
plotProfile | plotProfile.png | Plot profile of the TSS region
plotProfile | computeMatrix.gz | Compute Matrix from deeptools to create custom plots other than plotProfile
## Common Quality Control Metrics
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See [HERE](hhttps://ccg.epfl.ch//var/sib_april15/cases/landt12/strand_correlation.html) for more details.
4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
## Common Errors
If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know and we will add it here.
## Citation
Please cite individual programs and versions used [HERE](docs/references.md), and the pipeline doi:[10.5281/zenodo.2648844](https://doi.org/10.5281/zenodo.2648844). Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
## Programs and Versions
+ python/3.6.1-2-anaconda [website](https://www.anaconda.com/download/#linux) [citation](docs/references.md)
+ trimgalore/0.4.1 [website](https://github.com/FelixKrueger/TrimGalore) [citation](docs/references.md)
+ cutadapt/1.9.1 [website](https://cutadapt.readthedocs.io/en/stable/index.html) [citation](docs/references.md)
+ bwa/intel/0.7.12 [website](http://bio-bwa.sourceforge.net/) [citation](docs/references.md)
+ samtools/1.6 [website](http://samtools.sourceforge.net/) [citation](docs/references.md)
+ sambamba/0.6.6 [website](http://lomereiter.github.io/sambamba/) [citation](docs/references.md)
+ bedtools/2.26.0 [website](https://bedtools.readthedocs.io/en/latest/) [citation](docs/references.md)
+ deeptools/2.5.0.1 [website](https://deeptools.readthedocs.io/en/develop/) [citation](docs/references.md)
+ phantompeakqualtools/1.2 [website](https://github.com/kundajelab/phantompeakqualtools) [citation](docs/references.md)
+ macs/2.1.0-20151222 [website](http://liulab.dfci.harvard.edu/MACS/) [citation](docs/references.md)
+ UCSC_userApps/v317 [website](https://genome.ucsc.edu/util.html) [citation](docs/references.md)
+ R/3.4.1 [website](https://www.r-project.org/) [citation](docs/references.md)
+ SPP/1.14
+ meme/4.11.1-gcc-openmpi [website](http://meme-suite.org/doc/install.html?man_type=web) [citation](docs/references.md)
+ ChIPseeker [website](https://bioconductor.org/packages/release/bioc/html/ChIPseeker.html) [citation](docs/references.md)
+ DiffBind [website](https://bioconductor.org/packages/release/bioc/html/DiffBind.html) [citation](docs/references.md)
Additionally, the pipeline is designed to work with [Astrocyte Workflow System](https://astrocyte-test.biohpc.swmed.edu/static/docs/index.html) using a simple web interface.
## Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
......@@ -9,7 +9,7 @@
# A unique identifier for the workflow package, text/underscores only
name: 'chipseq_analysis_bicf'
# Who wrote this?
author: 'Beibei Chen and Venkat Malladi'
author: 'Holly Ruess, Spencer D. Barnes, Beibei Chen and Venkat Malladi'
# A contact email address for questions
email: 'bicf@utsouthwestern.edu'
# A more informative title for the workflow package
......@@ -51,6 +51,7 @@ workflow_modules:
- 'UCSC_userApps/v317'
- 'R/3.3.2-gccmkl'
- 'meme/4.11.1-gcc-openmpi'
- 'pandoc/2.7'
# A list of parameters used by the workflow, defining how to present them,
......@@ -93,29 +94,25 @@ workflow_parameters:
description: |
One or more input FASTQ files from a ChIP-seq expereiment and a design
file with the link bewetwen the same file name and sample id
regex: ".*(fastq|fq)*"
regex: ".*(fastq|fq)*gz"
min: 2
- id: pairedEnd
type: select
required: true
choices:
- [ 'true', 'True']
- [ 'false', 'False']
- [ 'true', 'true']
- [ 'false', 'false']
description: |
In single-end sequencing, the sequencer reads a fragment from only one
end to the other, generating the sequence of base pairs. In paired-end
reading it starts at one read, finishes this direction at the specified
read length, and then starts another round of reading from the opposite
end of the fragment.
If Paired-end: True, if Single-end: False.
- id: design
- id: designFile
type: file
required: true
description: |
A design file listing sample id, fastq files, corresponding control id
and additional information about the sample.
regex: ".*tsv"
regex: ".*txt"
- id: genome
type: select
......@@ -127,13 +124,41 @@ workflow_parameters:
description: |
Reference species and genome used for alignment and subsequent analysis.
- id: skipDiff
type: select
required: true
choices:
- [ 'true', 'true']
- [ 'false', 'false']
description: |
Skip differential peak analysis
- id: skipMotif
type: select
required: true
choices:
- [ 'true', 'true']
- [ 'false', 'false']
description: |
Skip motif calling
- id: skipPlotProfile
type: select
required: true
choices:
- [ 'true', 'true']
- [ 'false', 'false']
description: |
Skip Plot Profile Analysis
- id: astrocyte
type: string
type: select
choices:
- [ 'true', 'true' ]
required: true
default: 'true'
regex: "true"
description: |
Ensure configuraton for astrocyte
Ensure configuraton for astrocyte.
# -----------------------------------------------------------------------------
......@@ -154,8 +179,4 @@ vizapp_cran_packages:
# List of any Bioconductor packages, not provided by the modules,
# that must be made available to the vizapp
vizapp_bioc_packages:
- none
vizapp_github_packages:
- js229/Vennerable
vizapp_bioc_packages: []
#!/bin/bash
#SBATCH --job-name=CHIPseq
#SBATCH --partition=super
#SBATCH --output=CHIPseq.%j.out
#SBATCH --error=CHIPseq.%j.err
module load nextflow/0.31.0
module add python/3.6.1-2-anaconda
nextflow run workflow/main.nf \
--reads '/path/to/*fastq.gz' \
--designFile '/path/to/design.txt' \
--genome 'GRCm38' \
--pairedEnd 'true'
SampleID,Tissue,Factor,Condition,Replicate,Peaks,bamReads,bamControl,ControlID,PeakCaller
A_1,A,H3K27AC,A,1,A_1.broadPeak,A_1.bam,A_1_input.bam,A_1_input,bed
A_2,A,H3K27AC,A,2,A_2.broadPeak,A_2.bam,A_2_input.bam,A_2_input,bed
B_1,B,H3K27AC,B,1,B_1.broadPeak,B_1.bam,B_1_input.bam,B_1_input,bed
B_2,B,H3K27AC,B,2,B_2.broadPeak,B_2.bam,B_2_input.bam,B_2_input,bed
C_1,C,H3K27AC,C,1,C_1.broadPeak,C_1.bam,C_1_input.bam,C_1_input,bed
C_2,C,H3K27AC,C,2,C_2.broadPeak,C_2.bam,C_2_input.bam,C_2_input,bed
sample_id experiment_id biosample factor treatment replicate control_id fastq_read1
A1 A tissueA H3K27AC None 1 B1 A1.fastq.gz
A2 A tissueA H3K27AC None 2 B2 A2.fastq.gz
B1 B tissueB Input None 1 B1 B1.fastq.gz
B2 B tissueB Input None 2 B2 B2.fastq.gz
# Astrocyte ChIPseq analysis Workflow Package
# BICF ChIP-seq Analysis Workflow
## Introduction
**ChIP-seq Analysis** is a bioinformatics best-practice analysis pipeline used for chromatin immunoprecipitation (ChIP-seq) data analysis.
The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
### Pipeline Steps
Report issues to the Bioinformatic Core Facility [BICF](mailto:BICF@UTSouthwestern.edu)
1) Trim adaptors TrimGalore!
2) Align with BWA
3) Filter reads with Sambamba S
4) Quality control with DeepTools
5) Calculate Cross-correlation using SPP and PhantomPeakQualTools
6) Signal profiling using MACS2
7) Call consenus peaks
8) Annotate all peaks using ChipSeeker
9) Use MEME-ChIP to find motifs in original peaks
10) Find differential expressed peaks using DiffBind (If more than 1 experiment)
### Pipeline Steps
+ There are 11 steps to the pipeline
1. Check input files
2. Trim adaptors TrimGalore!
3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
4. Mark duplicates with Sambamba, and filter reads with samtools
5. Quality metrics with deep tools
6. Calculate cross-correlation using PhantomPeakQualTools
7. Call peaks with MACS
8. Calculate consensus peaks
9. Annotate all peaks using ChipSeeker
10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
11. Use MEME-ChIP to find motifs in original peaks
## Workflow Parameters
reads - Choose all ChIP-seq fastq files for analysis.
pairedEnd - Choose True/False if data is paired-end
design - Choose the file with the experiment design information. TSV format
1. One or more input FASTQ files from a ChIP-seq expereiment and a design file with the link bewetwen the same file name and sample id (required) - Choose all ChIP-seq fastq files for analysis.
2. In single-end sequencing, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs. In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment. (Paired-end: True, Single-end: False) (required)
3. A design file listing sample id, fastq files, corresponding control id and additional information about the sample.
genome - Choose a genomic reference (genome).
4. Reference species and genome used for alignment and subsequent analysis. (required)
5. Run differential peak analysis (required). Must have at least 2 replicates per experiment and at least 2 experiments.
6. Run motif calling (required). Top 600 peaks sorted by p-value.
7. Ensure configuration for astrocyte. (required; always true)
## Design file
The following columns are necessary, must be named as in template. An design file template can be downloaded [HERE](https://git.biohpc.swmed.edu/bchen4/chipseq_analysis/raw/master/docs/design_example.csv)
SampleID
The id of the sample. This will be the header in output files, please make sure it is concise
Tissue
Tissue of the sample
Factor
Factor of the experiment
Condition
This is the group that will be used for pairwise differential expression analysis
Replicate
Replicate id
Peaks
The file name of the peak file for this sample
bamReads
The file name of the IP BAM for this sample
bamControl
The file name of the control BAM for this sample
ContorlID
The id of the control sample
PeakCaller
The peak caller used
+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
2. experiment_id&nbsp;&nbsp;&nbsp;&nbsp;biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
3. biosample&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for tissue type or cell line
4. factor&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol for antibody target
5. treatment&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;symbol of treatment applied
6. replicate&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;a number, usually from 1-3 (i.e. 1)
7. control_id&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;sample_id name that is the control for this sample
8. fastq_read1&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 1 for SE or PC data
9. fastq_read2&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;name of fastq file 2 for PE data
+ See [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
+ See [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
## Output Files
Folder | File | Description
--- | --- | ---
design | N/A | Inputs used for analysis; can ignore
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
alignReads | *.srt.bam | sorted bam file
filterReads | *.dup.qc | QC metrics of find duplicate reads (sambamba)
filterReads | *.filt.nodup.bam | filtered bam file with duplicate reads removed
filterReads | *.filt.nodup.bam.bai | indexed filtered bam file
filterReads | *.filt.nodup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools)
filterReads | *.filt.nodup.pbc.qc | QC metrics of library complexity
convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format
convertReads | *.filt.nodup.tagAlign.gz | bed alignent in BEDPE format, same as bedse unless samples are paired-end
multiqcReport | multiqc_report.html | Quality control report of NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite.
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | *_fingerprint.pdf | plot to determine if the antibody-treatment enriched sufficiently
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
experimentQC | sample_mbs.npz | array of multiple BAM summaries
crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios
crossReads | *.cc.qc | cross-correlation metrics. File [HEADER](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/xcor_header.txt)
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
peakAnnotation | *.chipseeker_annotation.tsv | annotated narrowPeaks file
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
motifSearch | sorted-*.replicated.narrowPeak | Top 600 peaks sorted by p-value; input for motifSearch
motifSearch | *_memechip/combined.meme | MEME identified motifs
diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
plotProfile | plotProfile.png | Plot profile of the TSS region
plotProfile | computeMatrix.gz | Compute Matrix from deeptools to create custom plots other than plotProfile
## Common Quality Control Metrics
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/phantompeaks.md) for more details.
4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and < 1.1 for input. Column 10 (RSC) should be > 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
### Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), Department of Bioinformatics
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
### References
* ChipSeeker: http://bioconductor.org/packages/release/bioc/html/ChIPseeker.html
* DiffBind: http://bioconductor.org/packages/release/bioc/html/DiffBind.html
* Deeptools: https://deeptools.github.io/
* MEME-ChIP: http://meme-suite.org/doc/meme-chip.html
+ python/3.6.1-2-anaconda [website](https://www.anaconda.com/download/#linux) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)
+ trimgalore/0.4.1 [website](https://github.com/FelixKrueger/TrimGalore) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)
+ cutadapt/1.9.1 [website](https://cutadapt.readthedocs.io/en/stable/index.html) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)
+ bwa/intel/0.7.12 [website](http://bio-bwa.sourceforge.net/) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)
+ samtools/1.6 [website](http://samtools.sourceforge.net/) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)
+ sambamba/0.6.6 [website](http://lomereiter.github.io/sambamba/) [citation](https://git.biohpc.swmed.edu/BICF/Astrocyte/chipseq_analysis/blob/master/docs/references.md)