Commit 298c748b authored by Venkat Malladi's avatar Venkat Malladi

Merge branch '31-BlacklistRegions' into 'master'

Resolve "Blacklist of regions"

Closes #31

See merge request !25
parents ec163e1a 09a1b99d
Pipeline #7124 passed with stages
in 452 minutes and 11 seconds
......@@ -30,5 +30,5 @@ paired_end_mouse:
- branches
- master
script:
- NXF_OPTS="-Dleveldb.mmap=false" nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR451NAE_PE.txt" --genome 'GRCm38' --pairedEnd true
- NXF_OPTS="-Dleveldb.mmap=false" nextflow run workflow/main.nf --designFile "$CI_PROJECT_DIR/test_data/design_ENCSR451NAE_PE.txt" --genome 'GRCm38' --pairedEnd true --blacklist true
- pytest -m pairedend_mouse
......@@ -23,6 +23,7 @@ All notable changes to this project will be documented in this file.
- Added FRiP score (per sample: reads in replicated peaks/total reads per)
- Added TSS enrichment
- Added Fragment/Insert Length size
- Removal of blacklist regions
## [publish_1.0.0 ] - 2019-12-03
Initial release of pipeline
......
......@@ -43,14 +43,14 @@ $ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/atacseq_analysis.git
## Pipeline
+ There are ??? steps to the pipeline
+ There are 9 steps to the pipeline
1. Check input files
2. Trim adaptors with TrimGalore!
3. Map reads with BWA, filter with SamTools, and sort with Sambamba
4. Mark duplicates with Sambamba, Filter reads with SamTools, calculate percentage of reads in mitochondria, and calculate library complexity with SamTools and bedtools
5. Calculate cross-correlation using PhantomPeakQualTools
6. Call peaks with MACS2 from overlaps of pooled replicates
7. Call consensus peaks
7. Call consensus peaks and optional removal of blacklist peaks
8. QC metrics
9. MultiQC report
......@@ -79,6 +79,7 @@ callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control sig
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated_noblacklist.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates) with blacklist regions removed
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
......
......@@ -110,6 +110,19 @@ workflow_parameters:
read length, and then starts another round of reading from the opposite
end of the fragment.
- id: blacklist
type: select
required: true
choices:
- [ 'true', 'True']
- [ 'false', 'False']
description: |
The Blacklisted Regions aim to identify a comprehensive set of regions
that have anomalous, unstructured, high signal/read counts in next gen
sequencing experiments independent of cell line and type of experiment.
If you would like these regions excluded from replicated peaks, select
True. (recommended)
- id: design
type: file
required: true
......
......@@ -43,14 +43,14 @@ $ git clone git@git.biohpc.swmed.edu:BICF/Astrocyte/atacseq_analysis.git
## Pipeline
+ There are ??? steps to the pipeline
+ There are 9 steps to the pipeline
1. Check input files
2. Trim adaptors with TrimGalore!
3. Map reads with BWA, filter with SamTools, and sort with Sambamba
4. Mark duplicates with Sambamba, Filter reads with SamTools, calculate percentage of reads in mitochondria, and calculate library complexity with SamTools and bedtools
5. Calculate cross-correlation using PhantomPeakQualTools
6. Call peaks with MACS2 from overlaps of pooled replicates
7. Call consensus peaks
7. Call consensus peaks and optional removal of blacklist peaks
8. QC metrics
9. MultiQC report
......@@ -79,6 +79,7 @@ callPeaksMACS | pooled/*.pvalue_signal.bw | bigwig data file; sample/control sig
callPeaksMACS | pooled/*_pooled.narrowPeak | peaks file; see [HERE](https://genome.ucsc.edu/FAQ/FAQformat.html#format12) for ENCODE narrowPeak header format
consensusPeaks | *.rejected.narrowPeak | peaks not supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates)
consensusPeaks | *.replicated_noblacklist.narrowPeak | peaks supported by multiple testing (replicates and pseudo-replicates) with blacklist regions removed
experimentQC | coverage.pdf | plot to assess the sequencing depth of a given sample
experimentQC | heatmeap_SpearmanCorr.pdf | plot of Spearman correlation between samples
experimentQC | heatmeap_PearsonCorr.pdf | plot of Pearson correlation between samples
......
......@@ -18,11 +18,11 @@ process {
queue = '256GB,256GBv1'
}
withName: filterReads {
module = ['python/3.6.1-2-anaconda', 'samtools/1.4.1', 'sambamba/0.6.6', 'bedtools/2.26.0']
module = ['python/3.6.1-2-anaconda', 'samtools/1.4.1', 'sambamba/0.6.6', 'bedtools/2.25.0']
queue = '128GB,256GB,256GBv1'
}
withName: convertReads {
module = ['python/3.6.1-2-anaconda', 'samtools/1.4.1', 'bedtools/2.26.0']
module = ['python/3.6.1-2-anaconda', 'samtools/1.4.1', 'bedtools/2.25.0']
queue = '128GB,256GB,256GBv1'
}
withName: crossReads {
......@@ -38,15 +38,15 @@ process {
executor = 'local'
}
withName: callPeaksMACS {
module = ['python/3.6.1-2-anaconda', 'macs/2.1.0-20151222', 'phantompeakqualtools/1.2', 'UCSC_userApps/v317', 'bedtools/2.26.0']
module = ['python/3.6.1-2-anaconda', 'macs/2.1.0-20151222', 'phantompeakqualtools/1.2', 'UCSC_userApps/v317', 'bedtools/2.25.0']
queue = '128GB,256GB,256GBv1'
}
withName: consensusPeaks {
module = ['python/3.6.1-2-anaconda', 'bedtools/2.26.0']
module = ['python/3.6.1-2-anaconda', 'bedtools/2.25.0']
executor = 'local'
}
withName: experimentQC {
module = ['python/3.6.1-2-anaconda', 'deeptools/2.5.0.1', 'samtools/1.4.1', 'bedtools/2.26.0', 'singularity/2.6.1']
module = ['python/3.6.1-2-anaconda', 'deeptools/2.5.0.1', 'samtools/1.4.1', 'bedtools/2.25.0', 'singularity/2.6.1']
queue = '128GB,256GB,256GBv1'
}
withName: multiqcReport {
......@@ -64,12 +64,14 @@ params {
genomesize = 'hs'
chromsizes = '/project/shared/bicf_workflow_ref/GRCh38/genomefile.txt'
gtffile = '/project/shared/bicf_workflow_ref/GRCh38/gencode.v25.chr_patch_hapl_scaff.annotation.gtf'
blacklistfile = '/project/shared/bicf_workflow_ref/GRCh38/ENCFF356LFX.bed'
}
'GRCm38' {
bwa = '/project/shared/bicf_workflow_ref/GRCm38'
genomesize = 'mm'
chromsizes = '/project/shared/bicf_workflow_ref/GRCm38/genomefile.txt'
gtffile = '/project/shared/bicf_workflow_ref/GRCm38/gencode.vM20.annotation.gtf'
blacklistfile = '/project/shared/bicf_workflow_ref/GRCm38/ENCFF999QPV.bed'
}
}
}
......
......@@ -6,6 +6,7 @@
// Define Input variables
params.reads = "$baseDir/../test_data/*.fastq.gz"
params.pairedEnd = false
params.blacklist = false
params.designFile = "$baseDir/../test_data/design_ENCSR265ZXX_SE.txt"
params.genome = 'GRCh38'
params.genomes = []
......@@ -13,6 +14,7 @@ params.bwaIndex = params.genome ? params.genomes[ params.genome ].bwa ?: false :
params.genomeSize = params.genome ? params.genomes[ params.genome ].genomesize ?: false : false
params.chromSizes = params.genome ? params.genomes[ params.genome ].chromsizes ?: false : false
params.gtfFile = params.genome ? params.genomes[ params.genome ].gtffile ?: false : false
params.blacklistFile = params.genome ? params.genomes[ params.genome ].blacklistfile ?: false : false
params.astrocyte = false
params.outDir= "${baseDir}/output"
params.references = "${baseDir}/../docs/references.md"
......@@ -36,6 +38,7 @@ readsList = Channel
// Define regular variables
pairedEnd = params.pairedEnd
blacklist = params.blacklist
designFile = params.designFile
genomeSize = params.genomeSize
chromSizes = params.chromSizes
......@@ -43,6 +46,7 @@ outDir = params.outDir
references = params.references
multiqc = params.multiqc
gtfFile = params.gtfFile
blacklistFile = params.blacklistFile
process checkDesignFile {
......@@ -503,21 +507,38 @@ process consensusPeaks {
output:
file '*.replicated.*' into consensusPeaks
file '*.rejected.*' into rejectedPeaks
file '*.replicated_noblacklist*' optional true into blPeaks
file("design_exQC.tsv") into designExperimentQC
file('version_*.txt') into consensusPeaksVersions
script:
if (params.astrocyte == true) {
"""
module load python/3.6.1-2-anaconda
module load bedtools/2.26.0
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign}
"""
if (blacklist) {
"""
module load python/3.6.1-2-anaconda
module load bedtools/2.26.0
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign} -b ${blacklistFile}
"""
}
else {
"""
module load python/3.6.1-2-anaconda
module load bedtools/2.26.0
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign}
"""
}
}
else {
"""
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign}
"""
if (blacklist) {
"""
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign} -b ${blacklistFile}
"""
}
else {
"""
python3 ${baseDir}/scripts/overlap_peaks.py -d ${peaksDesign} -f ${preDiffDesign}
"""
}
}
}
......
......@@ -38,6 +38,11 @@ def get_args():
help="The design file of with bam files (tsv format).",
required=True)
parser.add_argument('-b', '--blacklist',
help="Bed file of blacklisted regions to remove",
required=False,
default="None")
args = parser.parse_args()
return args
......@@ -173,10 +178,28 @@ def overlap(experiment, design):
return os.path.abspath(overlapping_peaks_fn)
def blacklist_peaks(experiment, blacklist):
'''Remove blacklist peaks from replicated peaks'''
logger.info("Removing blacklist regions from replicated peaks")
narrowpead_input = "%s.replicated.narrowPeak" % (experiment)
# Assign output
blpo_filename = "%s.replicated_noblacklist.narrowPeak" % (experiment)
# Remove
steps = [
"bedtools intersect -v -a %s -b %s" % (narrowpead_input, blacklist),
r"""awk 'BEGIN{OFS="\t"} {if ($5>1000) $5=1000; print $0}'""",
]
out, err = utils.run_pipe(steps, blpo_filename)
def main():
args = get_args()
design = args.design
files = args.files
blacklist = args.blacklist
# Create a file handler
handler = logging.FileHandler('consensus_peaks.log')
......@@ -207,6 +230,10 @@ def main():
design_diff.to_csv("design_exQC.tsv", header=True, sep='\t', index=False)
# Remove blacklist regions; if blacklist = True
if os.path.exists(blacklist):
for experiment, df_experiment in design_peaks_df.groupby('experiment_id'):
bl_peaks = blacklist_peaks(experiment, blacklist)
if __name__ == '__main__':
main()
......@@ -44,3 +44,6 @@ def test_overlap_peaks_pairedend_mouse():
assert os.path.exists(os.path.join(test_output_path, 'ENCSR451NAE.rejected.narrowPeak'))
peak_file = test_output_path + 'ENCSR451NAE.replicated.narrowPeak'
assert utils.count_lines(peak_file) >= 78039
assert os.path.exists(os.path.join(test_output_path, 'ENCSR451NAE.replicated_noblacklist.narrowPeak'))
bl_peak_file = test_output_path + 'ENCSR451NAE.replicated_noblacklist.narrowPeak'
assert utils.count_lines(bl_peak_file) >= 77532
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment