Merge branch 'develop' into 'master'

Develop See merge request !8

Merge branch 'develop' into 'master'
Develop See merge request !8
7003dff5 · Gervaise Henry · c08056cb · f448eb64 · 7003dff5 · 7003dff5
Commit 7003dff5 authored 5 years ago by Gervaise Henry
--- a/README.md
+++ b/README.md
@@ -3,15 +3,48 @@
 |[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/master)|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/badges/develop/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/commits/develop)|

 10x Genomics scRNA-Seq (cellranger) mkfastq Pipeline
-========================================
+==================================================

 Introduction
 ------------

-This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics. It takes bcl files from sequencing of 10x Genomics Single Cell Gene Expression libraries, and deconvolutles the reads by the samples' barcodes.
+This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics (which uses Illumina's bcl2fastq). It takes demultiplexes samples from 10x Genomics Single Cell Gene Expression libraries into fastqs.
+
+FastQC is run on the resulting fastq and those reports and bcl2fastq reports are collated with the MultiQC tool.

 The pipeline uses Nextflow, a bioinformatics workflow tool.

 This pipeline is primarily used with a SLURM cluster on the BioHPC Cluster. However, the pipeline should be able to run on any system that Nextflow supports.

 Additionally, the pipeline is designed to work with Astrocyte Workflow System using a simple web interface.
+
+To Run:
+-------
+
+* Available parameters:
+  * **--name**
+        * run name, puts outputs in a directory with this name
+        * eg: **--name 'test'**
+  * **--bcl**
+        * Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
+        * There can be multiple basecall files, but they all will be demultiplexed by the same design file.
+        * eg: **--bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz'**
+  * **--designFile**
+        * path to design file (csv format) location
+        * column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
+        * column 2 = "Sample" (sample name)
+        * column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
+        * can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
+        * eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv'**
+    * **--outDir**
+        * optional output directory for run
+        * eg: **--outDir 'test'**
+    * FULL EXAMPLE:
+
+**nextflow run workflow/main.nf --name 'test' --bcl '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.tar.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_mkfastq/simple/cellranger-tiny-bcl-simple-1_2_0.csv' --outDir 'test'**
+
+* Design example:
+
+| Lane | Sample      | Index     |
+|------|-------------|-----------|
+| *    | test_sample | SI-P03-C9 |
--- a/astrocyte_pkg.yml
+++ b/astrocyte_pkg.yml
@@ -81,14 +81,14 @@ workflow_parameters:
    type: files
    required: true
    description: |
-      One or more input Tarball BCL files from a sequencing of 10x single-cell expereiment .
-    regex: ".*tar.gz"
+      One or more input tarball (+/- gunzip) basecall files (BCL) from a sequencing of 10x single-cell expereiment (can be .tar or .tar.gz).
+    regex: ".*tar*"
    min: 1

  - id: designFile
    type: file
    required: true
-    regex: ".*csv"
+    regex: "*.csv"
    description: |
      A design file listing lane, sample, corresponding index.


--- a/test_data/design.csv
+++ b/test_data/design.csv
--- a/docs/index.md
+++ b/docs/index.md
@@ -4,12 +4,35 @@
 Introduction
 ------------

-This pipeline is a wrapper for the cellranger count tool from 10x Genomics. It takes fastq files from 10x Genomics Single Cell Gene Expression libraries, performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate gene-barcode matrices, determine clusters, and perform gene expression analysis.
+This pipeline is a wrapper for the cellranger mkfastq tool from 10x Genomics (which uses Illumina's bcl2fastq). It takes demultiplexes samples from 10x Genomics Single Cell Gene Expression libraries into fastqs.

-The pipeline uses Nextflow, a bioinformatics workflow tool.
+FastQC is run on the resulting fastq and those reports and bcl2fastq reports are collated with the MultiQC tool.

+The pipeline uses Nextflow, a bioinformatics workflow tool.

+To Run:
+-------

+* Workflow parameters:
+  * **bcl**
+        * Base call files (tarballed [*.tar] +/- gunzipping [*.tar.gz] from a sequencing of 10x single-cell expereiment, supports pigr parallelization).
+        * There can be multiple basecall files, but they all will be demultiplexed by the same design file.
+        * REQUIRED
+  * **design file**
+        * A design file listing lane, sample, corresponding sample barcode. There can be multiple rows with the same sample name, if there are multiple fastq's for that sample.
+        * REQUIRED
+        * column 1 = "Lane" (number of lanes to demultiplex, */** for all lanes)
+        * column 2 = "Sample" (sample name)
+        * column 3 = "Index" (10x sample index barcode, eg SI-GA-A1)
+        * eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/docs/design.csv)
+
+
+* Design example:
+
+    | Lane | Sample      | Index     |
+    |------|-------------|-----------|
+    | *    | test_sample | SI-P03-C9 |
+    


 Credits

--- a/test_data/.gitkeep
+++ b/test_data/.gitkeep
--- a/workflow/conf/biohpc.config
+++ b/workflow/conf/biohpc.config
@@ -3,17 +3,29 @@ process {
  queue='super'

  // Process specific configuration
-  $checkDesignFile {
+  withLabel:checkDesignFile {
    module = ['python/3.6.1-2-anaconda']
    executor = 'local'
  }
-  $untarBCL {
+  withLabel:untarBCL {
    module = ['pigz/2.4']
    queue = 'super'
  }
-  $mkfastq {
+  withLabel:mkfastq {
    module = ['cellranger/3.0.2', 'bcl2fastq/2.19.1']
-    queue = 'super'
+    queue = '128GB,256GB,256GBv1,384GB'
+  }
+  withLabel:fastqc {
+    module = ['fastqc/0.11.5', 'parallel']
+    executor = 'super'
+  }
+  withLabel:versions {
+    module = ['python/3.6.1-2-anaconda']
+    executor = 'local'
+  }
+  withLabel:multiqc {
+    module = ['multiqc/1.7']
+    executor = 'super'
  }
 }


--- a/workflow/main.nf
+++ b/workflow/main.nf
@@ -4,6 +4,7 @@
 // Note - $baseDir is the location of this workflow file main.nf

 // Define Input variables
+params.name = "small"
 params.bcl = "$baseDir/../test_data/*.tar.gz"
 params.designFile = "$baseDir/../test_data/design.csv"
 params.outDir = "$baseDir/output"
@@ -12,14 +13,15 @@ params.outDir = "$baseDir/output"
 tarList = Channel.fromPath( params.bcl )

 // Define regular variables
+name = params.name
 designLocation = Channel
  .fromPath(params.designFile)
  .ifEmpty { exit 1, "design file not found: ${params.designFile}" }
 outDir = params.outDir

 process checkDesignFile {
-
-  publishDir "$outDir/${task.process}", mode: 'copy'
+  tag "$name"
+  publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'

  input:

@@ -42,7 +44,6 @@ process checkDesignFile {

 process untarBCL {
  tag "$tar"
-
  publishDir "$outDir/${task.process}", mode: 'copy'

  input:
@@ -58,15 +59,18 @@ process untarBCL {
  """
  hostname
  ulimit -a
-  module load pigz/2.4
-  tar -xvf $tar -I pigz
+  name=`echo ${tar} | rev | cut -f1 -d '.' | rev`;
+  if [ "\${name}" == "gz" ];
+  then   module load pigz/2.4;
+  tar -xvf $tar -I pigz;
+  else tar -xvf ${tar};
+  fi;
  """
 }

-
 process mkfastq {
  tag "${bcl.baseName}"
-
+  queue '128GB,256GB,256GBv1,384GB'
  publishDir "$outDir/${task.process}", mode: 'copy'

  input:
@@ -76,7 +80,10 @@ process mkfastq {

  output:

-  file("**/outs/fastq_path/**/*") into fastqPaths
+  file("**/outs/fastq_path/**/*") into mkfastqPaths
+  file("**/outs/**/*.fastq.gz") into fastqPaths
+  file("**/outs/fastq_path/Stats/Stats.json") into bqcPaths
+  file("version*.txt") into versionPaths_mkfastq

  script:

@@ -85,6 +92,83 @@ process mkfastq {
  ulimit -a
  module load cellranger/3.0.2
  module load bcl2fastq/2.19.1
-  cellranger mkfastq --nopreflight --id="${bcl.baseName}" --run=$bcl --csv=$designPaths -r \$SLURM_CPUS_ON_NODE  -p \$SLURM_CPUS_ON_NODE  -w \$SLURM_CPUS_ON_NODE 
+  sh $baseDir/scripts/versions_mkfastq.sh
+  cellranger mkfastq --id="${bcl.baseName}" --run=$bcl --csv=$designPaths -r \$SLURM_CPUS_ON_NODE  -p \$SLURM_CPUS_ON_NODE  -w \$SLURM_CPUS_ON_NODE 
+  """
+}
+
+
+process fastqc {
+  tag "$name"
+  queue 'super'
+  publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
+
+  input:
+  file fastqPaths
+
+  output:
+
+  file("*fastqc.*") into fqcPaths
+  file("version*.txt") into versionPaths_fastqc
+
+  script:
+
+  """
+  hostname
+  ulimit -a
+  module load fastqc/0.11.5
+  module load parallel
+  sh $baseDir/scripts/fastqc.sh
+  """
+}
+
+
+process versions {
+  tag "$name"
+  publishDir "$outDir/misc/${task.process}/$name", mode: 'copy'
+
+  input:
+
+  file versionPaths_mkfastq
+  file versionPaths_fastqc
+
+  output:
+
+  file("*.yaml") into yamlPaths
+
+  script:
+
+  """
+  hostname
+  ulimit -a
+  module load python/3.6.1-2-anaconda
+  echo $workflow.nextflow.version > version_nextflow.txt
+  python3 $baseDir/scripts/generate_versions.py -f version_*.txt -o versions
+  """
+}
+
+
+process multiqc {
+  tag "$name"
+  queue 'super'
+  publishDir "$outDir/${task.process}/$name", mode: 'copy'
+
+  input:
+
+  file bqcPaths
+  file fqcPaths
+  file yamlPaths
+
+  output:
+
+  file("*") into mqcPaths
+
+  script:
+
+  """
+  hostname
+  ulimit -a
+  module load multiqc/1.7
+  multiqc . -c $baseDir/scripts/.multiqc_config.yaml 
  """
 }
--- a/workflow/scripts/.multiqc_config.yaml
+++ b/workflow/scripts/.multiqc_config.yaml
+top_modules:
+    - 'Software Versions'
+
+module_order:
+    - bcl2fastq
+    - fastqc
--- a/workflow/scripts/fastqc.sh
+++ b/workflow/scripts/fastqc.sh
+#!/bin/bash
+
+find . -name '*.fastq.gz' | awk '{printf("fastqc \"%s\"\n", $0)}' | parallel -j 25 --verbose
+#find . -name '*fastqc.*' | xargs -I '{}' mv '{}' ./
+
+fastqc --version |& grep 'FastQC v' | sed -n -e 's/^FastQC v//p' > version_fastqc.txt  
--- a/workflow/scripts/generate_versions.py
+++ b/workflow/scripts/generate_versions.py
+#!/usr/bin/env python3
+# -*- coding: utf-8 -*-
+
+'''Make YAML of software versions.'''
+
+from __future__ import print_function
+from collections import OrderedDict
+import re
+import logging
+import argparse
+import numpy as np
+
+EPILOG = '''
+For more details:
+        %(prog)s --help
+'''
+
+# SETTINGS
+
+logger = logging.getLogger(__name__)
+logger.addHandler(logging.NullHandler())
+logger.propagate = False
+logger.setLevel(logging.INFO)
+
+SOFTWARE_REGEX = {
+    'Nextflow': ['version_nextflow.txt', r"(\S+)"],
+    'cellranger mkfastq': ['version_cellranger.mkfastq.txt', r"(\S+)"],
+    'bcl2fastq': ['version_bcl2fastq.txt', r"(\S+)"],
+    'fastqc': ['version_fastqc.txt', r"(\S+)"],
+}
+
+
+def get_args():
+    '''Define arguments.'''
+
+    parser = argparse.ArgumentParser(
+        description=__doc__, epilog=EPILOG,
+        formatter_class=argparse.RawDescriptionHelpFormatter)
+
+    parser.add_argument('-f', '--files',
+                        help="The version files.",
+                        required=True,
+                        nargs='*')
+
+    parser.add_argument('-o', '--output',
+                        help="The out file name.",
+                        required=True)
+
+    args = parser.parse_args()
+    return args
+
+
+def check_files(files):
+    '''Check if version files are found.'''
+
+    logger.info("Running file check.")
+
+    software_files = np.array(list(SOFTWARE_REGEX.values()))[:,0]
+
+    extra_files =  set(files) - set(software_files)
+
+    if len(extra_files) > 0:
+            logger.error('Missing regex: %s', list(extra_files))
+            raise Exception("Missing regex: %s" % list(extra_files))
+
+
+def main():
+    args = get_args()
+    files = args.files
+    output = args.output
+
+    out_filename = output + '_mqc.yaml'
+
+    results = OrderedDict()
+    results['Nextflow'] = '<span style="color:#999999;\">N/A</span>'
+    results['cellranger mkfastq'] = '<span style="color:#999999;\">N/A</span>'
+    results['bcl2fastq'] = '<span style="color:#999999;\">N/A</span>'
+    results['fastqc'] = '<span style="color:#999999;\">N/A</span>'
+
+    # Check for version files:
+    check_files(files)
+
+    # Search each file using its regex
+    for k, v in SOFTWARE_REGEX.items():
+        with open(v[0]) as x:
+            versions = x.read()
+            match = re.search(v[1], versions)
+            if match:
+                results[k] = "v{}".format(match.group(1))
+
+    # Dump to YAML
+    print(
+        '''
+        id: 'Software Versions'
+        section_name: 'Software Versions'
+        section_href: 'https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_mkfastq/'
+        plot_type: 'html'
+        description: 'are collected at run time from the software output.'
+        data: |
+            <dl class="dl-horizontal">
+        '''
+    , file = open(out_filename, "w"))
+
+    for k, v in results.items():
+        print("            <dt>{}</dt><dd>{}</dd>".format(k, v), file = open(out_filename, "a"))
+    print("            </dl>", file = open(out_filename, "a"))
+
+
+if __name__ == '__main__':
+    main()
--- a/workflow/scripts/versions_mkfastq.sh
+++ b/workflow/scripts/versions_mkfastq.sh
+#!/bin/bash
+
+cellranger mkfastq --version | grep 'cellranger mkfastq ' | sed 's/.*(\(.*\))/\1/' > version_cellranger.mkfastq.txt
+bcl2fastq --version |& grep 'bcl2fastq v' | sed -n -e 's/^bcl2fastq v//p' > version_bcl2fastq.txt