Skip to content
Snippets Groups Projects
Commit ab0033d0 authored by Gervaise Henry's avatar Gervaise Henry :cowboy:
Browse files

Merge branch '5-MultiQC' into 'develop'

Resolve "Add MultiQC"

Closes #5

See merge request !47
parents 87ca7c11 fb55f7c1
Branches
Tags
2 merge requests!53Develop,!47Resolve "Add MultiQC"
Pipeline #4205 passed with stages
in 24 minutes and 3 seconds
......@@ -27,7 +27,7 @@ To Run:
* path to the fastq location
* R1 and R2 only necessary but can include I2
* only fastq's in designFile (see below) are used, not present will be ignored
* eg: **--fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/\*.fastq.gz'**
* eg: **--fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r100k/\*.fastq.gz'**
* **--designFile**
* path to design file (csv format) location
* column 1 = "Sample"
......@@ -35,7 +35,7 @@ To Run:
* column 3 = "fastq_R2"
* can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
* can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/blob/master/docs/design.csv)
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/design.csv'**
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r100k/design.csv'**
* **--genome**
* reference genome
* requires workflow/conf/biohpc.config to work
......@@ -44,8 +44,8 @@ To Run:
* *'GRCh38-1.2.0'* = Human GRCh38 release 84
* *'hg19-3.0.0'* = Human GRCh37 (hg19) release 87
* *'hg19-1.2.0'* = Human GRCh37 (hg19) release 84
* *'mm10-3.0.0'* = Human GRCm38 (mm10) release 93
* *'mm10-3.0.0'* = Human GRCm38 (mm10) release 84
* *'mm10-3.0.0'* = Mouse GRCm38 (mm10) release 93
* *'mm10-3.0.0'* = Mouse GRCm38 (mm10) release 84
* *'hg19_and_mm10-3.0.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 93
* *'hg19_and_mm10-1.2.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 84
* *'ercc92-1.2.0'* = ERCC.92 Spike-In
......@@ -92,7 +92,7 @@ To Run:
* eg: **--outDir 'test'**
* FULL EXAMPLE:
**nextflow main.nf --fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/\*.fastq.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/design.csv' --genome 'GRCh38-3.0.0' --kitVersion 'three' --version '3.0.2' --outDir 'test'**
**nextflow run workflow/main.nf --fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r100k/\*.fastq.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r100k/design.csv' --genome 'GRCh38-3.0.0' --kitVersion 'three' --version '3.0.2' --outDir 'test'**
* Design example:
......
### References
1. **python**:
* Anaconda (Anaconda Software Distribution, [https://anaconda.com](https://anaconda.com))
2. **cellranger**
* Cellranger mkfastq [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq)
3. **MultiQc**:
* Ewels P., Magnusson M., Lundin S. and Käller M. 2016. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19): 3047–3048. doi:[10.1093/bioinformatics/btw354](https://dx.doi.org/10.1093/bioinformatics/btw354)
workflow/conf/bicf_logo.png

24.3 KiB

......@@ -18,6 +18,14 @@ process {
module = ['cellranger/3.0.2']
queue = '128GB,256GB,256GBv1,384GB'
}
withLabel: versions {
module = ['python/3.6.1-2-anaconda','pandoc/2.7','multiqc/1.7']
executor = 'local'
}
withLabel: multiqc {
module = ['multiqc/1.7']
executor = 'local'
}
}
params {
......
# Custom Logo
custom_logo: 'bicf_logo.png'
custom_logo_url: 'https://www.utsouthwestern.edu/labs/bioinformatics/'
custom_logo_title: 'Bioinformatics Core Facility'
report_header_info:
- Contact E-mail: 'bicf@utsouthwestern.edu'
- Application Type: 'cellranger_count'
- Department: 'Bioinformatic Core Facility, Department of Bioinformatics'
# Title to use for the report.
title: BICF CellRanger Count Analysis Report
report_comment: >
This report has been generated by the <a href="https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count"
target="_blank">BICF/cellranger_count</a> pipeline.
custom_data:
metrics_summary:
file_format: 'tsv'
id: 'metrics_summary'
contents: 'Estimated Number of Cells Mean Reads per Cell Median Genes per Cell Number of Reads Valid Barcodes Sequencing Saturation Q30 Bases in Barcode Q30 Bases in RNA Read Q30 Bases in UMI Reads Mapped to Genome Reads Mapped Confidently to Genome Reads Mapped Confidently to Intergenic Regions Reads Mapped Confidently to Intronic Regions Reads Mapped Confidently to Exonic Regions Reads Mapped Confidently to Transcriptome Reads Mapped Antisense to Gene Fraction Reads in Cells Total Genes Detected Median UMI Counts per Cell'
section_name: 'Metrics Summary'
plot_type: 'generalstats'
sp:
metrics_summary:
fn: 'metrics_summary_mqc.tsv'
table_columns_placement:
metrics_summary:
Estimated Number of Cells: 1
Mean Reads per Cell: 2
Median Genes per Cell: 3
Number of Reads: 4
Sequencing Saturation: 5
Reads Mapped Confidently to Genome: 6
Reads Mapped Confidently to Transcriptome: 7
Fraction Reads in Cells: 8
Total Genes Detected: 9
Median UMI Counts per Cell: 10
Valid Barcodes: 1100
Reads Mapped Antisense to Gene: 1200
table_columns_visible:
metrics_summary:
Q30 Bases in Barcode: False
Q30 Bases in RNA Read: False
Q30 Bases in UMI: False
Reads Mapped to Genome: False
Reads Mapped Confidently to Intergenic Regions: False
Reads Mapped Confidently to Intronic Regions: False
Reads Mapped Confidently to Exonic Regions: False
thousandsSep_format: ''
report_section_order:
software_versions:
order: -1100
software_references:
order: -1200
......@@ -14,6 +14,8 @@ params.kitVersion = 'three'
params.version = '3.0.2'
params.astrocyte = false
params.outDir = "$baseDir/output"
params.multiqcConf = "$baseDir/conf/multiqc_config.yaml"
params.references = "$baseDir/../docs/references.md"
// Assign variables if astrocyte
if (params.astrocyte) {
......@@ -54,23 +56,23 @@ forceCells = params.forceCells
chemistryParam = params.chemistryParam
version = params.version
outDir = params.outDir
multiqcConf = params.multiqcConf
references = params.references
process checkDesignFile {
process checkDesignFile {
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
module 'python/3.6.1-2-anaconda'
input:
file designLocation
file fastqList
output:
file("design.checked.csv") into designPaths
script:
"""
hostname
ulimit -a
......@@ -78,6 +80,7 @@ process checkDesignFile {
"""
}
// Parse design file
samples = designPaths
.splitCsv (sep: ',', header: true)
......@@ -105,6 +108,7 @@ forceCells302 = forceCells
chemistryParam301 = chemistryParam
chemistryParam302 = chemistryParam
process count211 {
queue '128GB,256GB,256GBv1,384GB'
tag "$sample"
......@@ -112,15 +116,14 @@ process count211 {
module 'cellranger/2.1.1'
input:
set sample, file("${sample}_S1_L00?_R1_001.fastq.gz"), file("${sample}_S1_L00?_R2_001.fastq.gz") from samples211
file ref from refLocation211.first()
expectCells211
forceCells211
output:
file("**/outs/**") into outPaths211
file("*_metrics_summary.tsv") into metricsSummary211
when:
version == '2.1.1'
......@@ -132,6 +135,7 @@ process count211 {
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --expect-cells=$expectCells211
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
} else {
"""
......@@ -139,10 +143,12 @@ process count211 {
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --force-cells=$forceCells211
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
}
}
process count301 {
queue '128GB,256GB,256GBv1,384GB'
tag "$sample"
......@@ -150,7 +156,6 @@ process count301 {
module 'cellranger/3.0.1'
input:
set sample, file("${sample}_S1_L00?_R1_001.fastq.gz"), file("${sample}_S1_L00?_R2_001.fastq.gz") from samples301
file ref from refLocation301.first()
expectCells301
......@@ -158,8 +163,8 @@ process count301 {
chemistryParam301
output:
file("**/outs/**") into outPaths301
file("*_metrics_summary.tsv") into metricsSummary301
when:
version == '3.0.1'
......@@ -167,21 +172,24 @@ process count301 {
script:
if (forceCells301 == 0){
"""
hostname
hostname
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --expect-cells=$expectCells301 --chemistry="$chemistryParam301"
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
} else {
"""
hostname
hostname
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --force-cells=$forceCells301 --chemistry="$chemistryParam301"
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
}
}
process count302 {
queue '128GB,256GB,256GBv1,384GB'
tag "$sample"
......@@ -189,7 +197,6 @@ process count302 {
module 'cellranger/3.0.2'
input:
set sample, file("${sample}_S?_L001_R1_001.fastq.gz"), file("${sample}_S?_L001_R2_001.fastq.gz") from samples302
file ref from refLocation302.first()
expectCells302
......@@ -197,8 +204,8 @@ process count302 {
chemistryParam302
output:
file("**/outs/**") into outPaths302
file("*_metrics_summary.tsv") into metricsSummary302
when:
version == '3.0.2'
......@@ -210,6 +217,7 @@ process count302 {
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --expect-cells=$expectCells302 --chemistry="$chemistryParam302"
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
} else {
"""
......@@ -217,6 +225,56 @@ process count302 {
ulimit -a
bash "$baseDir/scripts/filename_check.sh" -r "$ref"
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --force-cells=$forceCells302 --chemistry="$chemistryParam302"
sed -E 's/("([^"]*)")?,/\\2\t/g' ${sample}/outs/metrics_summary.csv | tr -d "," | sed "s/^/${sample}\t/" > ${sample}_metrics_summary.tsv
"""
}
}
process versions {
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
module 'python/3.6.1-2-anaconda:pandoc/2.7:multiqc/1.7'
input:
output:
file("*.yaml") into yamlPaths
script:
"""
hostname
ulimit -a
echo $workflow.nextflow.version > version_nextflow.txt
echo $version > version_cellranger.txt
multiqc --version | tr -d 'multiqc, version ' > version_multiqc.txt
python3 "$baseDir/scripts/generate_versions.py" -f version_*.txt -o versions
python3 "$baseDir/scripts/generate_references.py" -r "$references" -o references
"""
}
metricsSummary = metricsSummary211.mix(metricsSummary301, metricsSummary302)
// Generate MultiQC Report
process multiqc {
tag "$name"
queue 'super'
publishDir "$outDir/${task.process}/$name", mode: 'copy'
module 'multiqc/1.7'
input:
file ('*') from metricsSummary.collect()
file yamlPaths
output:
file "multiqc_report.html" into mqcPaths
script:
"""
hostname
ulimit -a
awk 'FNR==1 && NR!=1{next;}{print}' *.tsv > metrics_summary_mqc.tsv
sed -i '1s/^.*\tE/Sample\tE/' metrics_summary_mqc.tsv
multiqc -c $multiqcConf .
"""
}
#
# * --------------------------------------------------------------------------
# * Licensed under MIT (https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/LICENSE.md)
# * --------------------------------------------------------------------------
#
'''Make header for HTML of references.'''
import argparse
import subprocess
import shlex
import logging
EPILOG = '''
For more details:
%(prog)s --help
'''
# SETTINGS
logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.INFO)
def get_args():
'''Define arguments.'''
parser = argparse.ArgumentParser(
description=__doc__, epilog=EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('-r', '--reference',
help="The reference file (markdown format).",
required=True)
parser.add_argument('-o', '--output',
help="The out file name.",
default='references')
args = parser.parse_args()
return args
def main():
args = get_args()
reference = args.reference
output = args.output
out_filename = output + '_mqc.yaml'
# Header for HTML
print(
'''
id: 'software_references'
section_name: 'Software References'
description: 'This section describes references for the tools used.'
plot_type: 'html'
data: |
'''
, file = open(out_filename, "w")
)
# Turn Markdown into HTML
references_html = 'bash -c "pandoc -p {} | sed \'s/^/ /\' >> {}"'
references_html = references_html.format(reference, out_filename)
subprocess.check_call(shlex.split(references_html))
if __name__ == '__main__':
main()
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''Make YAML of software versions.'''
from __future__ import print_function
from collections import OrderedDict
import re
import logging
import argparse
import numpy as np
EPILOG = '''
For more details:
%(prog)s --help
'''
# SETTINGS
logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.INFO)
SOFTWARE_REGEX = {
'Nextflow': ['version_nextflow.txt', r"(\S+)"],
'Cellranger Count': ['version_cellranger.txt', r"(\S+)"],
'MultiQC': ['version_multiqc.txt', r"(\S+)"],
}
def get_args():
'''Define arguments.'''
parser = argparse.ArgumentParser(
description=__doc__, epilog=EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('-f', '--files',
help="The version files.",
required=True,
nargs='*')
parser.add_argument('-o', '--output',
help="The out file name.",
required=True)
args = parser.parse_args()
return args
def check_files(files):
'''Check if version files are found.'''
logger.info("Running file check.")
software_files = np.array(list(SOFTWARE_REGEX.values()))[:,0]
extra_files = set(files) - set(software_files)
if len(extra_files) > 0:
logger.error('Missing regex: %s', list(extra_files))
raise Exception("Missing regex: %s" % list(extra_files))
def main():
args = get_args()
files = args.files
output = args.output
out_filename = output + '_mqc.yaml'
results = OrderedDict()
results['Nextflow'] = '<span style="color:#999999;\">N/A</span>'
results['Cellranger Count'] = '<span style="color:#999999;\">N/A</span>'
results['MultiQC'] = '<span style="color:#999999;\">N/A</span>'
# Check for version files:
check_files(files)
# Search each file using its regex
for k, v in SOFTWARE_REGEX.items():
with open(v[0]) as x:
versions = x.read()
match = re.search(v[1], versions)
if match:
results[k] = "v{}".format(match.group(1))
# Dump to YAML
print(
'''
id: 'software_versions'
section_name: 'Software Versions'
section_href: 'https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/'
plot_type: 'html'
description: 'are collected at run time from the software output.'
data: |
<dl class="dl-horizontal">
'''
, file = open(out_filename, "w"))
for k, v in results.items():
print(" <dt>{}</dt><dd>{}</dd>".format(k, v), file = open(out_filename, "a"))
print(" </dl>", file = open(out_filename, "a"))
if __name__ == '__main__':
main()
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment