+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
1. sample_id a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
2. experiment_id biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
3. biosample symbol for tissue type or cell line
3. biosample symbol for tissue type or cell line
4. factor symbol for antibody target
4. factor symbol for antibody target
5. treatment symbol of treatment applied
5. treatment symbol of treatment applied
callPeaksMACS | *.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | *.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
callPeaksMACS | *_peaks.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
consensusPeaks | design_annotatePeaks.tsv | design file; can ignore
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | design_diffPeaks.csv | design file; can ignore
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | unique_experiments.csv | design file; can ignore
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_pie.pdf | pie graph of where narrow annotated peaks occur
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
peakAnnotation | *.chipseeker_upsetplot.pdf | upsetplot showing the count of overlaps of the genes with different annotated location
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
motifSearch | *_memechip/index.html | interactive HTML link of MEME output
...
@@ -119,10 +118,11 @@ diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks
...
@@ -119,10 +118,11 @@ diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks
## Common Quality Control Metrics
## Common Quality Control Metrics
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1.filterReads/*.filt.nodup.pbc.qc: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/); NRF>0.9, PBC1>0.9, and PBC2>10
1.multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
3. crossReads/*.filt.nodup.tagAlign.15.tagAlign.gz.cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See [HERE](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/06_QC_cross_correlation.html) for more details.
3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See [HERE](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/06_QC_cross_correlation.html) for more details.
4. crossReads/*.filt.nodup.tagAlign.15.tagAlign.gz.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and <1.1forinput.Column10(RSC)shouldbe> 0.8 for experiment and < 0.8 for input. See [HERE](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/06_QC_cross_correlation.html) for more details.
4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and <1.1forinput.Column10(RSC)shouldbe> 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
## Common Errors
## Common Errors
...
@@ -140,15 +140,19 @@ If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know
...
@@ -140,15 +140,19 @@ If you find an error, please let the [BICF](mailto:BICF@UTSouthwestern.edu) know
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
## Citation
## Citation
Please cite individual programs and versions used [HERE](docs/references.txt). Also, please look out for our pipeline to be published in the future [HERE](https://zenodo.org/).
Please cite individual programs and versions used [HERE](docs/references.txt). Also, please look out for our pipeline to be published in the future [HERE](https://zenodo.org/).
The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
The pipeline uses [Nextflow](https://www.nextflow.io), a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
### Pipeline Steps
Report issues to the Bioinformatic Core Facility [BICF](mailto:BICF@UTSouthwestern.edu)
1) Trim adaptors TrimGalore!
### Pipeline Steps
2) Align with BWA
+ There are 11 steps to the pipeline
3) Filter reads with Sambamba S
1. Check input files
4) Quality control with DeepTools
2. Trim adaptors TrimGalore!
5) Calculate Cross-correlation using SPP and PhantomPeakQualTools
3. Aligned trimmed reads with bwa, and sorts/converts to bam with samtools
6) Signal profiling using MACS2
4. Mark duplicates with Sambamba, and filter reads with samtools
7) Call consenus peaks
5. Quality metrics with deep tools
8) Annotate all peaks using ChipSeeker
6. Calculate cross-correlation using PhantomPeakQualTools
9) Use MEME-ChIP to find motifs in original peaks
7. Call peaks with MACS
10) Find differential expressed peaks using DiffBind (If more than 1 experiment)
8. Calculate consensus peaks
9. Annotate all peaks using ChipSeeker
10. Calculate Differential Binding Activity with DiffBind (If more than 1 rep in more than 1 experiment)
11. Use MEME-ChIP to find motifs in original peaks
## Workflow Parameters
## Workflow Parameters
reads - Choose all ChIP-seq fastq files for analysis.
1. One or more input FASTQ files from a ChIP-seq expereiment and a design file with the link bewetwen the same file name and sample id (required) - Choose all ChIP-seq fastq files for analysis.
pairedEnd - Choose True/False if data is paired-end
2. In single-end sequencing, the sequencer reads a fragment from only one end to the other, generating the sequence of base pairs. In paired-end reading it starts at one read, finishes this direction at the specified read length, and then starts another round of reading from the opposite end of the fragment. (Paired-end: True, Single-end: False) (required)
design - Choose the file with the experiment design information. TSV format
3. A design file listing sample id, fastq files, corresponding control id and additional information about the sample.
genome - Choose a genomic reference (genome).
genome - Choose a genomic reference (genome).
4. Reference species and genome used for alignment and subsequent analysis. (required)
5. Run differential peak analysis (required). Must have at least 2 replicates per experiment and at least 2 experiments.
6. Run motif calling (required). Top 600 peaks sorted by p-value.
7. Ensure configuraton for astrocyte. (required; always true)
## Design file
## Design file
The following columns are necessary, must be named as in template. An design file template can be downloaded [HERE](https://git.biohpc.swmed.edu/bchen4/chipseq_analysis/raw/master/docs/design_example.csv)
+ The Design file is a tab-delimited file with 8 columns for Single-End and 9 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names can only begin with a letter. Columns must be as follows:
1. sample_id a short, unique, and concise name used to label output files; will be used as a control_id if it is the control sample
SampleID
2. experiment_id biosample_treatment_factor; same name given for all replicates of treatment. Will be used for the consensus header.
The id of the sample. This will be the header in output files, please make sure it is concise
3. biosample symbol for tissue type or cell line
Tissue
4. factor symbol for antibody target
Tissue of the sample
5. treatment symbol of treatment applied
Factor
6. replicate a number, usually from 1-3 (i.e. 1)
Factor of the experiment
7. control_id sample_id name that is the control for this sample
Condition
8. fastq_read1 name of fastq file 1 for SE or PC data
This is the group that will be used for pairwise differential expression analysis
9. fastq_read2 name of fastq file 2 for PE data
Replicate
Replicate id
Peaks
+ See [HERE](test_data/design_ENCSR729LGA_PE.txt) for an example design file, paired-end
The file name of the peak file for this sample
+ See [HERE](test_data/design_ENCSR238SGC_SE.txt) for an example design file, single-end
bamReads
The file name of the IP BAM for this sample
## Output Files
bamControl
The file name of the control BAM for this sample
Folder | File | Description
ContorlID
--- | --- | ---
The id of the control sample
design | N/A | Inputs used for analysis; can ignore
PeakCaller
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed
The peak caller used
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis
alignReads | *.srt.bam.flagstat.qc | QC metrics from the mapping process
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
diffPeaks | heatmap.pdf | Use only for replicated samples; heatmap of relationship of peak location and peak intensity
diffPeaks | normcount_peaksets.txt | Use only for replicated samples; peak set values of each sample
diffPeaks | pca.pdf | Use only for replicated samples; PCA of peak location and peak intensity
diffPeaks | *_diffbind.bed | Use only for replicated samples; bed file of peak locations between replicates
diffPeaks | *_diffbind.csv | Use only for replicated samples; CSV file of peaks between replicates
## Common Quality Control Metrics
+ These are the list of files that should be reviewed before continuing on with the CHIPseq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
1. multiqcReport/multiqc_report.html: follow the ChiP-seq standards [HERE](https://www.encodeproject.org/chip-seq/);
2. experimentQC/*_fingerprint.pdf: make sure the plots information is correct for your antibody/input. See [HERE](https://deeptools.readthedocs.io/en/develop/content/tools/plotFingerprint.html) for more details.
3. crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See [HERE](https://hbctraining.github.io/Intro-to-ChIPseq/lessons/06_QC_cross_correlation.html) for more details.
4. crossReads/*.cc.qc: Column 9 (NSC) should be > 1.1 for experiment and <1.1forinput.Column10(RSC)shouldbe> 0.8 for experiment and < 0.8 for input. See [HERE](https://genome.ucsc.edu/encode/qualityMetrics.html) for more details.
5. experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See [HERE](https://deeptools.readthedocs.io/en/develop/content/list_of_tools.html) for more details.
### Credits
### Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), Department of Bioinformatics
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility ([BICF](https://www.utsouthwestern.edu/labs/bioinformatics/)), in the [Department of Bioinformatics](https://www.utsouthwestern.edu/departments/bioinformatics/).
Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).