Skip to content
Snippets Groups Projects
Commit 7003dff5 authored by Gervaise Henry's avatar Gervaise Henry :cowboy:
Browse files

Merge branch 'develop' into 'master'

Develop

See merge request !8
parents c08056cb f448eb64
Branches
Tags
2 merge requests!60Master,!8Develop
Pipeline #3500 passed with stages
in 2 minutes and 10 seconds
......@@ -3,15 +3,48 @@
|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/master)|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/develop/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/develop)|
10x Genomics scRNA-Seq (cellranger) mkfastq Pipeline
========================================
==================================================
Introduction
------------
This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics. It takes bcl files from sequencing of 10x Genomics Single Cell Gene Expression libraries, and deconvolutles the reads by the samples' barcodes.
This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics (which uses Illumina's bcl2fastq). It takes demultiplexes samples from 10x Genomics Single Cell Gene Expression libraries into fastqs.
FastQC is run on the resulting fastq and those reports and bcl2fastq reports are collated with the MultiQC tool.
The pipeline uses Nextflow, a bioinformatics workflow tool.
This pipeline is primarily used with a SLURM cluster on the BioHPC Cluster. However, the pipeline should be able to run on any system that Nextflow supports.
Additionally, the pipeline is designed to work with Astrocyte Workflow System using a simple web interface.
To Run:
-------
* Available parameters:
* **--name**
* run name, puts outputs in a directory with this name
* eg: **--name 'test'**
* **--bcl**
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* eg: **--bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz'**
* **--designFile**
* path to design file (csv format) location
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv'**
* **--outDir**
* optional output directory for run
* eg: **--outDir 'test'**
* FULL EXAMPLE:
**nextflow run workflow/main.nf --name 'test' --bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv' --outDir 'test'**
* Design example:
| Lane | Sample | Index |
|------|-------------|-----------|
| * | test_sample | SI-P03-C9 |
......@@ -81,14 +81,14 @@ workflow_parameters:
type: files
required: true
description: |
One or more input Tarball BCL files from a sequencing of 10x single-cell expereiment .
regex: ".*tar.gz"
One or more input tarball (+/- gunzip) basecall files (BCL) from a sequencing of 10x single-cell expereiment (can be .tar or .tar.gz).
regex: ".*tar*"
min: 1
- id: designFile
type: file
required: true
regex: ".*csv"
regex: "*.csv"
description: |
A design file listing lane, sample, corresponding index.
......
File moved
......@@ -4,12 +4,35 @@
Introduction
------------
This pipeline is a wrapper for the cellranger count tool from 10x Genomics. It takes fastq files from 10x Genomics Single Cell Gene Expression libraries, performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate gene-barcode matrices, determine clusters, and perform gene expression analysis.
This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics (which uses Illumina's bcl2fastq). It takes demultiplexes samples from 10x Genomics Single Cell Gene Expression libraries into fastqs.
The pipeline uses Nextflow, a bioinformatics workflow tool.
FastQC is run on the resulting fastq and those reports and bcl2fastq reports are collated with the MultiQC tool.
The pipeline uses Nextflow, a bioinformatics workflow tool.
To Run:
-------
* Workflow parameters:
* **bcl**
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* REQUIRED
* **design file**
* A design file listing lane, sample, corresponding sample barcode. There can be multiple rows with the same sample name, if there are multiple fastq's for that sample.
* REQUIRED
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/docs/design.csv)
* Design example:
| Lane | Sample | Index |
|------|-------------|-----------|
| * | test_sample | SI-P03-C9 |
Credits
......
......@@ -3,17 +3,29 @@ process {
queue='super'
// Process specific configuration
$checkDesignFile {
withLabel:checkDesignFile {
module = ['python/3.6.1-2-anaconda']
executor = 'local'
}
$untarBCL {
withLabel:untarBCL {
module = ['pigz/2.4']
queue = 'super'
}
$mkfastq {
withLabel:mkfastq {
module = ['cellranger/3.0.2', 'bcl2fastq/2.19.1']
queue = 'super'
queue = '128GB,256GB,256GBv1,384GB'
}
withLabel:fastqc {
module = ['fastqc/0.11.5', 'parallel']
executor = 'super'
}
withLabel:versions {
module = ['python/3.6.1-2-anaconda']
executor = 'local'
}
withLabel:multiqc {
module = ['multiqc/1.7']
executor = 'super'
}
}
......
......@@ -4,6 +4,7 @@
// Note - $baseDir is the location of this workflow file main.nf
// Define Input variables
params.name = "small"
params.bcl = "$baseDir/../test_data/*.tar.gz"
params.designFile = "$baseDir/../test_data/design.csv"
params.outDir = "$baseDir/output"
......@@ -12,14 +13,15 @@ params.outDir = "$baseDir/output"
tarList = Channel.fromPath( params.bcl )
// Define regular variables
name = params.name
designLocation = Channel
.fromPath(params.designFile)
.ifEmpty { exit 1, "design file not found: ${params.designFile}" }
outDir = params.outDir
process checkDesignFile {
publishDir "$outDir/${task.process}", mode: 'copy'
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
input:
......@@ -42,7 +44,6 @@ process checkDesignFile {
process untarBCL {
tag "$tar"
publishDir "$outDir/${task.process}", mode: 'copy'
input:
......@@ -58,15 +59,18 @@ process untarBCL {
"""
hostname
ulimit -a
module load pigz/2.4
tar -xvf $tar -I pigz
name=`echo ${tar} | rev | cut -f1 -d '.' | rev`;
if [ "\${name}" == "gz" ];
then module load pigz/2.4;
tar -xvf $tar -I pigz;
else tar -xvf ${tar};
fi;
"""
}
process mkfastq {
tag "${bcl.baseName}"
queue '128GB,256GB,256GBv1,384GB'
publishDir "$outDir/${task.process}", mode: 'copy'
input:
......@@ -76,7 +80,10 @@ process mkfastq {
output:
file("**/outs/fastq_path/**/*") into fastqPaths
file("**/outs/fastq_path/**/*") into mkfastqPaths
file("**/outs/**/*.fastq.gz") into fastqPaths
file("**/outs/fastq_path/Stats/Stats.json") into bqcPaths
file("version*.txt") into versionPaths_mkfastq
script:
......@@ -85,6 +92,83 @@ process mkfastq {
ulimit -a
module load cellranger/3.0.2
module load bcl2fastq/2.19.1
cellranger mkfastq --nopreflight --id="${bcl.baseName}" --run=$bcl --csv=$designPaths -r \$SLURM_CPUS_ON_NODE -p \$SLURM_CPUS_ON_NODE -w \$SLURM_CPUS_ON_NODE
sh $baseDir/scripts/versions_mkfastq.sh
cellranger mkfastq --id="${bcl.baseName}" --run=$bcl --csv=$designPaths -r \$SLURM_CPUS_ON_NODE -p \$SLURM_CPUS_ON_NODE -w \$SLURM_CPUS_ON_NODE
"""
}
process fastqc {
tag "$name"
queue 'super'
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
input:
file fastqPaths
output:
file("*fastqc.*") into fqcPaths
file("version*.txt") into versionPaths_fastqc
script:
"""
hostname
ulimit -a
module load fastqc/0.11.5
module load parallel
sh $baseDir/scripts/fastqc.sh
"""
}
process versions {
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
input:
file versionPaths_mkfastq
file versionPaths_fastqc
output:
file("*.yaml") into yamlPaths
script:
"""
hostname
ulimit -a
module load python/3.6.1-2-anaconda
echo $workflow.nextflow.version > version_nextflow.txt
python3 $baseDir/scripts/generate_versions.py -f version_*.txt -o versions
"""
}
process multiqc {
tag "$name"
queue 'super'
publishDir "$outDir/${task.process}/$name", mode: 'copy'
input:
file bqcPaths
file fqcPaths
file yamlPaths
output:
file("*") into mqcPaths
script:
"""
hostname
ulimit -a
module load multiqc/1.7
multiqc . -c $baseDir/scripts/.multiqc_config.yaml
"""
}
top_modules:
- 'Software Versions'
module_order:
- bcl2fastq
- fastqc
#!/bin/bash
find . -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j 25 --verbose
#find . -name '*fastqc.*' | xargs -I '{}' mv '{}' ./
fastqc --version |& grep 'FastQC v' | sed -n -e 's/^FastQC v//p' > version_fastqc.txt
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
'''Make YAML of software versions.'''
from __future__ import print_function
from collections import OrderedDict
import re
import logging
import argparse
import numpy as np
EPILOG = '''
For more details:
%(prog)s --help
'''
# SETTINGS
logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.INFO)
SOFTWARE_REGEX = {
'Nextflow': ['version_nextflow.txt', r"(\S+)"],
'cellranger mkfastq': ['version_cellranger.mkfastq.txt', r"(\S+)"],
'bcl2fastq': ['version_bcl2fastq.txt', r"(\S+)"],
'fastqc': ['version_fastqc.txt', r"(\S+)"],
}
def get_args():
'''Define arguments.'''
parser = argparse.ArgumentParser(
description=__doc__, epilog=EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('-f', '--files',
help="The version files.",
required=True,
nargs='*')
parser.add_argument('-o', '--output',
help="The out file name.",
required=True)
args = parser.parse_args()
return args
def check_files(files):
'''Check if version files are found.'''
logger.info("Running file check.")
software_files = np.array(list(SOFTWARE_REGEX.values()))[:,0]
extra_files = set(files) - set(software_files)
if len(extra_files) > 0:
logger.error('Missing regex: %s', list(extra_files))
raise Exception("Missing regex: %s" % list(extra_files))
def main():
args = get_args()
files = args.files
output = args.output
out_filename = output + '_mqc.yaml'
results = OrderedDict()
results['Nextflow'] = '<span style="color:#999999;\">N/A</span>'
results['cellranger mkfastq'] = '<span style="color:#999999;\">N/A</span>'
results['bcl2fastq'] = '<span style="color:#999999;\">N/A</span>'
results['fastqc'] = '<span style="color:#999999;\">N/A</span>'
# Check for version files:
check_files(files)
# Search each file using its regex
for k, v in SOFTWARE_REGEX.items():
with open(v[0]) as x:
versions = x.read()
match = re.search(v[1], versions)
if match:
results[k] = "v{}".format(match.group(1))
# Dump to YAML
print(
'''
id: 'Software Versions'
section_name: 'Software Versions'
section_href: 'https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/'
plot_type: 'html'
description: 'are collected at run time from the software output.'
data: |
<dl class="dl-horizontal">
'''
, file = open(out_filename, "w"))
for k, v in results.items():
print(" <dt>{}</dt><dd>{}</dd>".format(k, v), file = open(out_filename, "a"))
print(" </dl>", file = open(out_filename, "a"))
if __name__ == '__main__':
main()
#!/bin/bash
cellranger mkfastq --version | grep 'cellranger mkfastq ' | sed 's/.*(\(.*\))/\1/' > version_cellranger.mkfastq.txt
bcl2fastq --version |& grep 'bcl2fastq v' | sed -n -e 's/^bcl2fastq v//p' > version_bcl2fastq.txt
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment