Commit 33e60071 authored by Gervaise Henry's avatar Gervaise Henry 🤠

Merge branch 'develop' into 'master'

Develop

See merge request !41
parents 4c37ee84 2a53502c
Pipeline #4506 passed with stages
in 6 minutes and 26 seconds
before_script:
- module load astrocyte
- module load python/3.6.1-2-anaconda
- pip install --user pytest-pythonpath==0.7.1 pytest-cov==2.5.1
- module load nextflow/0.31.1_Ignite
- mkdir -p test_data/simple1
- mkdir -p test_data/simple2
......@@ -14,14 +15,44 @@ stages:
astrocyte_check:
stage: astrocyte
script:
- astrocyte_cli check ../cellranger_mkfastq
- astrocyte_cli check ../cellranger_mkfastq
artifacts:
expire_in: 2 days
retry:
max: 1
when:
- always
simple_1FC:
stage: simple
except:
- tags
script:
- nextflow run workflow/main.nf --bcl "test_data/simple1/*.tar.gz" --designFile "test_data/simple1/cellranger-tiny-bcl-simple-1_2_0.csv"
- nextflow run workflow/main.nf --bcl "test_data/simple1/*.tar.gz" --designFile "test_data/simple1/cellranger-tiny-bcl-simple-1_2_0.csv"
- pytest -m simple1
artifacts:
name: "$CI_JOB_NAME"
when: always
paths:
- .nextflow.log
expire_in: 2 days
retry:
max: 1
when:
- always
simple_2FC:
stage: simple
script:
- nextflow run workflow/main.nf --bcl "test_data/simple2/*.tar.gz" --designFile "test_data/simple2/cellranger-tiny-bcl-simple-1_2_0.csv"
- nextflow run workflow/main.nf --bcl "test_data/simple2/*.tar.gz" --designFile "test_data/simple2/cellranger-tiny-bcl-simple-1_2_0.csv"
- pytest -m simple2
artifacts:
name: "$CI_JOB_NAME"
when: always
paths:
- .nextflow.log
expire_in: 2 days
retry:
max: 1
when:
- always
# v1.2.0 (in development)
**User Facing**
* Add references to of tools to mutiQC report
* Add BICF details to multiqc report
* Create cellranger_count design file (if only 1 flowcell is inputted)
**Background**
* Add DOI (develop branch)
* Add changelog as link to astrocyte docs (master branch)
* Update example design file link in astrocyte docs (master branch)
* Check tarballed bcl directory for spaces and exit if it contains one...cellranger mkfastq cannot handle spaces (develop branch)
* Move untar (including space check) to bash script
* Add Jeremy Mathews to author list
* Apply style guide
* Add pytests for ouptuts
*Known Bugs*
* cellranger mkfastq will not accept spaces in path for run param even if quoted, issue raised on 10XGenomics/cellranger github issue [#31](https://github.com/10XGenomics/cellranger/issues/31)
* note: 10x doesn't check github issues, emailed instead
* note: pipeline checks for spaces and exits prematurely if found
* If multiple flowcells (tar'd) files are inputted then there will be multiple fastq's by the same name, currently dealing with that name conflict is not tractable
* note: if multiple bcl files are detected then cellranger_count design file is not created
# v1.1.4
### User Facing
**User Facing**
* Fix design file not visible in Astrocyte
* Fix handling of multiple flowcells in 1 submission
### Background
**Background**
* Move multiqc config to conf folder
* Add CI test for multiple flowcells
* Add changelog
* Quote design/tarball/$baseDir path in processes in case of spaces
### *Known Bugs*
*Known Bugs*
* cellranger mkfastq will not accept spaces in path for run param even if quoted, issue raised on 10XGenomics/cellranger github issue [#31](https://github.com/10XGenomics/cellranger/issues/31)
......@@ -2,6 +2,8 @@
|:-:|:-:|
|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/master)|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/develop/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/develop)|
[![DOI](https://zenodo.org/badge/DOI/10.5281/zenodo.2652611.svg)](https://doi.org/10.5281/zenodo.2652611)
10x Genomics scRNA-Seq (cellranger) mkfastq Pipeline
==================================================
......@@ -23,28 +25,39 @@ To Run:
* Available parameters:
* **--name**
* run name, puts outputs in a directory with this name
* eg: **--name 'test'**
* run name, puts outputs in a directory with this name
* eg: **--name 'test'**
* **--bcl**
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* eg: **--bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz'**
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* eg: **--bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple1/cellranger-tiny-bcl-1_2_0.tar.gz'**
* **--designFile**
* path to design file (csv format) location
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv'**
* **--outDir**
* optional output directory for run
* eg: **--outDir 'test'**
* FULL EXAMPLE:
**nextflow run workflow/main.nf --name 'test' --bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv' --outDir 'test'**
* path to design file (csv format) location
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
* can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/blob/master/docs/design.csv)
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv'**
* **--outDir**
* optional output directory for run
* eg: **--outDir 'test'**
* FULL EXAMPLE:
```
nextflow run workflow/main.nf --name 'test' --bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple1/cellranger-tiny-bcl-1_2_0.tar.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple1/cellranger-tiny-bcl-simple-1_2_0.csv' --outDir 'test'
```
* Design example:
| Lane | Sample | Index |
|------|-------------|-----------|
| * | test_sample | SI-P03-C9 |
[**CHANGELOG**](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/blob/develop/CHANGELOG.md)
Credits
-------
This worklow is was developed jointly with the [Bioinformatic Core Facility (BICF), Department of Bioinformatics](http://www.utsouthwestern.edu/labs/bioinformatics/)
Please cite in publications: Pipeline was developed by BICF from funding provided by **Cancer Prevention and Research Institute of Texas (RP150596)**.
......@@ -9,7 +9,7 @@
# A unique identifier for the workflow package, text/underscores only
name: 'cellranger_mkfastq'
# Who wrote this?
author: 'Gervaise H. Henry, Venkat Malladi, and Jon Gesell'
author: 'Gervaise H. Henry, Jon Gesell, Jeremy Mathews, and Venkat Malladi'
# A contact email address for questions
email: 'bicf@utsouthwestern.edu'
# A more informative title for the workflow package
......@@ -85,13 +85,13 @@ workflow_parameters:
required: true
description: |
One or more input tarball (+/- gunzip) basecall files (bcl) from a sequencing of 10x single-cell expereiment (can be .tar or .tar.gz).
regex: ".*tar*"
regex: ".*\\.tar*"
min: 1
- id: designFile
type: file
required: true
regex: ".*csv"
regex: ".*\\.csv"
description: |
A design file listing lane, sample, corresponding index.
......
......@@ -15,16 +15,16 @@ To Run:
* Workflow parameters:
* **bcl**
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* REQUIRED
* Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
* There can be multiple basecall files, but they all will be demultiplexed by the same design file.
* REQUIRED
* **design file**
* A design file listing lane, sample, corresponding sample barcode. There can be multiple rows with the same sample name, if there are multiple fastq's for that sample.
* REQUIRED
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/docs/design.csv)
* A design file listing lane, sample, corresponding sample barcode. There can be multiple rows with the same sample name, if there are multiple fastq's for that sample.
* REQUIRED
* column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
* column 2 = "Sample" (sample name)
* column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
* eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/blob/master/docs/design.csv)
* Design example:
......@@ -35,6 +35,8 @@ To Run:
[**CHANGELOG**](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/blob/master/CHANGELOG.md)
Credits
-------
This worklow is was developed jointly with the [Bioinformatic Core Facility (BICF), Department of Bioinformatics](http://www.utsouthwestern.edu/labs/bioinformatics/)
......
### References
1. **python**:
* Anaconda (Anaconda Software Distribution, [https://anaconda.com](https://anaconda.com))
2. **pigz**:
* Parallel implementation of gzip [https://zlib.net/pigz/](https://zlib.net/pigz/)
3. **bcl2fastq**:
* Ilumina's bcl2fastq [https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html](https://support.illumina.com/sequencing/sequencing_software/bcl2fastq-conversion-software.html)
3. **cellranger**:
* 10x Genomics cellranger mkfastq [https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq](https://support.10xgenomics.com/single-cell-gene-expression/software/pipelines/latest/using/mkfastq)
4. **fastqc**:
* fastqc [https://www.bioinformatics.babraham.ac.uk/projects/fastqc/](https://www.bioinformatics.babraham.ac.uk/projects/fastqc/)
5. **MultiQc**:
* Ewels P., Magnusson M., Lundin S. and Käller M. 2016. MultiQC: Summarize analysis results for multiple tools and samples in a single report. Bioinformatics 32(19): 3047–3048. doi:[10.1093/bioinformatics/btw354](https://dx.doi.org/10.1093/bioinformatics/btw354)
6. **Nextflow**:
* Di Tommaso P., Chatzou M., Floden E. W., Barja P. P., Palumbo E., and Notredame C. 2017. Nextflow enables reproducible computational workflows. Nature biotechnology 35(4): 316. doi:[10.1038/nbt.3820](https://doi.org/10.1038/nbt.3820)
......@@ -25,7 +25,7 @@ process {
}
withLabel:multiqc {
module = ['multiqc/1.7']
executor = 'super'
executor = 'local'
}
}
......
# Custom Logo
custom_logo: 'bicf_logo.png'
custom_logo_url: 'https://www.utsouthwestern.edu/labs/bioinformatics/'
custom_logo_title: 'Bioinformatics Core Facility'
report_header_info:
- Contact E-mail: 'bicf@utsouthwestern.edu'
- Application Type: 'cellranger_mkfastq'
- Department: 'Bioinformatic Core Facility, Department of Bioinformatics'
# Title to use for the report.
title: BICF CellRanger MKfastq Analysis Report
report_comment: >
This report has been generated by the <a href="https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq"
target="_blank">BICF/cellranger_mkfastq</a> pipeline.
module_order:
- bcl2fastq
- fastqc:
......@@ -19,3 +37,9 @@ module_order:
path_filters:
- '*_R2_*fastqc.zip'
- custom_content
report_section_order:
software_versions:
order: -1100
software_references:
order: -1200
......@@ -3,14 +3,23 @@
// Path to an input file, or a pattern for multiple inputs
// Note - $baseDir is the location of this workflow file main.nf
// Define Input variables
params.name = "run"
params.bcl = "$baseDir/../test_data/*.tar.gz"
params.designFile = "$baseDir/../test_data/design.csv"
params.outDir = "$baseDir/output"
params.bcl = "${baseDir}/../test_data/*.tar.gz"
params.designFile = "${baseDir}/../test_data/design.csv"
params.outDir = "${baseDir}/output"
params.multiqcConf = "${baseDir}/conf/multiqc_config.yaml"
params.references = "${baseDir}/../docs/references.md"
// Define List of Files
tarList = Channel.fromPath( params.bcl )
tarList = Channel
.fromPath( params.bcl )
bclCount = Channel
.fromPath( params.bcl )
.count()
// Define regular variables
name = params.name
......@@ -18,153 +27,176 @@ designLocation = Channel
.fromPath(params.designFile)
.ifEmpty { exit 1, "design file not found: ${params.designFile}" }
outDir = params.outDir
multiqcConf = params.multiqcConf
references = params.references
process checkDesignFile {
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
tag "${name}"
publishDir "${outDir}/misc/${task.process}/${name}", mode: 'copy'
module 'python/3.6.1-2-anaconda'
input:
file designLocation
file designLocation
output:
file("design.checked.csv") into designPaths
file("design.checked.csv") into designPaths
file("design.checked.csv") into designCount
script:
"""
hostname
ulimit -a
python3 ${baseDir}/scripts/check_design.py -d ${designLocation}
"""
"""
hostname
ulimit -a
python3 "$baseDir/scripts/check_design.py" -d "$designLocation"
"""
}
process untarBCL {
tag "$tar"
publishDir "$outDir/${task.process}", mode: 'copy'
tag "${tar}"
publishDir "${outDir}/${task.process}", mode: 'copy'
module 'pigz/2.4'
input:
file tar from tarList
file tar from tarList
output:
file("*") into bclPaths mode flatten
file("*") into bclPaths mode flatten
script:
"""
hostname
ulimit -a
bash ${baseDir}/scripts/untarBCL.sh -t ${tar}
"""
"""
hostname
ulimit -a
name=`echo ${tar} | rev | cut -f1 -d '.' | rev`;
if [ "\${name}" == "gz" ];
then tar -xvf "$tar" -I pigz;
else tar -xvf "$tar";
fi;
"""
}
process mkfastq {
tag "${bcl.baseName}"
queue '128GB,256GB,256GBv1,384GB'
publishDir "$outDir/${task.process}", mode: 'copy', pattern: "{*/outs/**/*.fastq.gz}"
publishDir "${outDir}/${task.process}", mode: 'copy', pattern: "{*/outs/**/*.fastq.gz}"
module 'cellranger/3.0.2:bcl2fastq/2.19.1'
input:
each bcl from bclPaths.collect()
file design from designPaths
each bcl from bclPaths.collect()
file design from designPaths
output:
file("**/outs/**/*.fastq.gz") into fastqPaths
file("**/outs/fastq_path/Stats/Stats.json") into bqcPaths
val "${bcl.baseName}" into bclName
file("**/outs/**/*.fastq.gz") into fastqPaths
file("**/outs/**/*.fastq.gz") into cellrangerCount
file("**/outs/fastq_path/Stats/Stats.json") into bqcPaths
val "${bcl.baseName}" into bclName
script:
"""
hostname
ulimit -a
cellranger mkfastq --id=${bcl.baseName} --run=${bcl} --csv=${design} -r \$SLURM_CPUS_ON_NODE -p \$SLURM_CPUS_ON_NODE -w \$SLURM_CPUS_ON_NODE
"""
}
if (bclCount.value == 1) {
process countDesign {
tag "${name}"
publishDir "${outDir}/misc/${task.process}/${name}", mode: 'copy'
input:
file fastqs from cellrangerCount.collect()
file design from designCount
output:
file("Cellranger_Count_Design.csv") into CountDesign
script:
"""
bash ${baseDir}/scripts/countDesign.sh
"""
}
"""
hostname
ulimit -a
cellranger mkfastq --id="${bcl.baseName}" --run="$bcl" --csv=$design -r \$SLURM_CPUS_ON_NODE -p \$SLURM_CPUS_ON_NODE -w \$SLURM_CPUS_ON_NODE
"""
}
process fastqc {
tag "$bclName"
tag "${bclName}"
queue 'super'
publishDir "$outDir/misc/${task.process}/$name/$bclName", mode: 'copy', pattern: "{*fastqc.zip}"
publishDir "${outDir}/misc/${task.process}/${name}/${bclName}", mode: 'copy', pattern: "{*fastqc.zip}"
module 'fastqc/0.11.5:parallel'
input:
file fastqPaths
val bclName
file fastqPaths
val bclName
output:
file("*fastqc.zip") into fqcPaths
file("*fastqc.zip") into fqcPaths
script:
"""
hostname
ulimit -a
find *.fastq.gz -exec mv {} ${bclName}.{} \\;
bash ${baseDir}/scripts/fastqc.sh
"""
"""
hostname
ulimit -a
find *.fastq.gz -exec mv {} $bclName.{} \\;
bash "$baseDir/scripts/fastqc.sh"
"""
}
process versions {
tag "$name"
publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
module 'python/3.6.1-2-anaconda:cellranger/3.0.2:bcl2fastq/2.19.1:fastqc/0.11.5'
tag "${name}"
publishDir "${outDir}/misc/${task.process}/${name}", mode: 'copy'
module 'python/3.6.1-2-anaconda:cellranger/3.0.2:bcl2fastq/2.19.1:fastqc/0.11.5:pandoc/2.7'
input:
output:
file("*.yaml") into yamlPaths
file("*.yaml") into yamlPaths
script:
"""
hostname
ulimit -a
echo ${workflow.nextflow.version} > version_nextflow.txt
bash ${baseDir}/scripts/versions_mkfastq.sh
bash ${baseDir}/scripts/versions_fastqc.sh
python3 ${baseDir}/scripts/generate_versions.py -f version_*.txt -o versions
python3 ${baseDir}/scripts/generate_references.py -r ${references} -o references
"""
"""
hostname
ulimit -a
echo $workflow.nextflow.version > version_nextflow.txt
bash "$baseDir/scripts/versions_mkfastq.sh"
bash "$baseDir/scripts/versions_fastqc.sh"
python3 "$baseDir/scripts/generate_versions.py" -f version_*.txt -o versions
"""
}
process multiqc {
tag "$name"
tag "${name}"
queue 'super'
publishDir "$outDir/${task.process}/$name", mode: 'copy', pattern: "{multiqc*}"
publishDir "${outDir}/${task.process}/${name}", mode: 'copy', pattern: "{multiqc*}"
module 'multiqc/1.7'
input:
file bqc name "bqc/?/*" from bqcPaths.collect()
file fqc name "fqc/*" from fqcPaths.collect()
file yamlPaths
file bqc name "bqc/?/*" from bqcPaths.collect()
file fqc name "fqc/*" from fqcPaths.collect()
file yamlPaths
output:
file("*") into mqcPaths
file("multiqc_report.html") into mqcPaths
script:
"""
hostname
ulimit -a
multiqc -c ${multiqcConf} .
"""
"""
hostname
ulimit -a
multiqc . -c "$baseDir/conf/multiqc_config.yaml"
"""
}
......@@ -35,7 +35,7 @@ def get_args():
def check_design_headers(design):
'''Check if design file conforms to sequencing type.'''
'''Check if design file has correct headers.'''
# Default headers
design_template = [
......
#!/bin/bash
#countDesign.sh
fastqs=$(ls *.fastq.gz)
design=$(ls *.csv)
sample=$(cat ${design} | tail -n +2 | cut -d ',' -f2)
for i in ${fastqs};
do
if [[ ${i} == *_S0_* ]]; then
continue
elif [[ ${i} == *_I* ]]; then
continue
else
good=(${good[@]} ${i})
fi
done
echo "Sample,fastq_R1,fastq_R2" > Cellranger_Count_Design.csv;
echo "${sample},${good[0]},${good[1]}" >> Cellranger_Count_Design.csv;
#!/bin/bash
find . -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j `grep -c ^processor /proc/cpuinfo` --verbose
find . -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j $(grep -c ^processor /proc/cpuinfo) --verbose
#find . -name '*fastqc.*' | xargs -I '{}' mv '{}' ./
#for i in `ls *.fastq.gz`;
#do echo "fastqc ${i}";
......
#!/usr/bin/env python
#
# * --------------------------------------------------------------------------
# * Licensed under MIT (https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/LICENSE.md)
# * --------------------------------------------------------------------------
#
'''Make header for HTML of references.'''
import argparse
import subprocess
import shlex
import logging
EPILOG = '''
For more details:
%(prog)s --help
'''
# SETTINGS
logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.INFO)
def get_args():
'''Define arguments.'''
parser = argparse.ArgumentParser(
description=__doc__, epilog=EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('-r', '--reference',
help="The reference file (markdown format).",
required=True)
parser.add_argument('-o', '--output',
help="The out file name.",
default='references')
args = parser.parse_args()
return args
def main():