Astrocyte ATAC-seq analysis Workflow Package
Introduction
BICF ATAC-seq is a bioinformatics best-practice analysis pipeline used for ATAC-seq (Assay for Transposase-Accessible Chromatin using sequencing) data analysis at BICF at UT Southwestern Department of Bioinformatics.
The pipeline uses Nextflow, a bioinformatics workflow tool. It pre-processes raw data from FastQ inputs, aligns the reads and performs extensive quality-control on the results.
This pipeline is primarily used with a SLURM cluster on the BioHPC Cluster. However, the pipeline should be able to run on any system that supports Nextflow.
Additionally, the pipeline is designed to work with Astrocyte Workflow System using a simple web interface.
Current version of the software and issue reports are at https://git.biohpc.swmed.edu/BICF/Astrocyte/atacseq_analysis
To download the current (working not tagged) version of the software
$ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/atacseq_analysis.git
Input files
1) Fastq Files
- You will need the full path to the files for the Bash Scipt
Design file
-
The Design file is a tab-delimited file with 4 columns for Single-End and 5 columns for Paired-End. Letter, numbers, and underlines can be used in the names. However, the names must begin with a letter. Columns must be as follows:
- sample_id - The id of the sample. This will be the header in output files, please make sure it is concise
- experiment_id - Same name given for all replicates of treatment. Will be used for the consensus header.
- replicate - Replicate number
- fastq_read1 - Name of fastq file 1 for SE or PE data
- fastq_read2 - Name of fastq file 2 for PE data
-
See HERE for an example design file, paired-end
-
See HERE for an example design file, single-end
Pipeline
- There are 9 steps to the pipeline
- Check input files
- Trim adaptors with TrimGalore!
- Map reads with BWA, filter with SamTools, and sort with Sambamba
- Mark duplicates with Sambamba, Filter reads with SamTools, calculate percentage of reads in mitochondria, and calculate library complexity with SamTools and bedtools
- Calculate cross-correlation using PhantomPeakQualTools
- Call peaks with MACS2 from overlaps of pooled replicates
- Call consensus peaks and optional removal of blacklist peaks
- QC metrics
- MultiQC report
Output Files
Folder | File | Description |
---|---|---|
design | N/A | Inputs used for analysis; can ignore |
trimReads | *_trimming_report.txt | report detailing how many reads were trimmed |
trimReads | *_trimmed.fq.gz | trimmed fastq files used for analysis |
alignReads | *.flagstat.qc | QC metrics from the mapping process |
alignReads | *.bam | sorted bam file |
filterReads | *.dedup.qc | QC metrics of find duplicate reads (sambamba) |
filterReads | *.dedup.bam | filtered bam file with duplicate reads removed |
filterReads | *.dedup.bam.bai | indexed filtered bam file |
filterReads | *.dedup.flagstat.qc | QC metrics of filtered bam file (mapping stats, samtools) |
filterReads | *.dedup.pbc.qc | QC metrics of library complexity |
filterReads | *.pctmito.tsv | QC percentage of reads in mitochondria |
convertReads | *.filt.nodup.bedse.gz | bed alignment in BEDPE format |
convertReads | *.tagAlign.gz | bed alignent in BEDPE or BEDSE format |
crossReads | *.cc.plot.pdf | Plot of cross-correlation to assess signal-to-noise ratios |
crossReads | *.cc.qc | cross-correlation metrics. File HEADER |
callPeaksMACS | pooled/*pooled.fc_signal.bw | bigwig data file; raw fold enrichment of sample/control |
callPeaksMACS | pooled/*pooled_peaks.xls | Excel file of peaks |
callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control signal adjusted for pvalue significance |
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see HERE for ENCODE narrowPeak header format |
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates) |
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates) |
consensusPeaks | *.replicated_noblacklist.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates) with blacklist regions removed |
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample |
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples |
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples |
experimentQC | sample_mbs.npz | array of multiple BAM summaries |
experimentQC | *.FRiPscore.tsv | File containing FRiP score |
experimentQC | *.TSSenrichment.tsv | File containing TSS enrichment |
experimentQC | *_large_tss-enrich.pdf | TSS Enrichment heatmap and metagene plot |
experimentQC | *_tss-enrich.pdf | TSS Enrichment metagene plot |
experimentQC | *.fragment_length_linear.pdf | Paired-end only, fragment/insert size densities, linear |
experimentQC | *.fragment_length_linear.pdf | Paired-end only, log10 fragment/insert size densities |
experimentQC | *.fragment_length_count.txt | Paired-end only, count and fragment length, raw data |
multiqcReport | multiqc_report.html | Quality control report of percent mitochondria, NRF, PBC1, PBC2, NSC, and RSC. Also contains software versions and references to cite. |
Common Quality Control Metrics
- These are the list of files that should be reviewed before continuing on with the ATAC-seq experiment. If your experiment fails any of these metrics, you should pause and re-evaluate whether the data should remain in the study.
- multiqcReport/multiqc_report.html: follow the ATAC-seq standards HERE;
- crossReads/*cc.plot.pdf: make sure your sample data has the correct signal intensity and location. See HERE for more details.
- filterReads/sample/*.pbc.qc: column 6 (NRF) > 0.9, column 7 (PBC1) > 0.9, and column 8 (PBC2) >3.
- experimentQC/coverage.pdf, experimentQC/heatmeap_SpearmanCorr.pdf, experimentQC/heatmeap_PearsonCorr.pdf: See HERE for more details.
- experimentQC/: Common Quality controls for ATAC-seq: FRiP score, TSS enrichment, Fragment/Insert length densities (paired-end only)
Common Errors
If you find an error, please let the BICF know and we will add it here.
Citation
Please cite individual programs and versions used HERE, and the pipeline doi: coming soon. Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).
Credits
This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), in the Department of Bioinformatics.