Merge branch 'develop' into '10-count.features'

Develop See merge request !35

Merge branch 'develop' into '10-count.features'
Develop See merge request !35
f74b0d59 · Gervaise Henry · fa36fc6f · d02c3e8b · f74b0d59 · f74b0d59
Commit f74b0d59 authored 6 years ago by Gervaise Henry
--- a/README.md
+++ b/README.md
+|*master*|*develop*|
+|:-:|:-:|
+|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/commits/master)|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/badges/develop/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/commits/develop)|
+
 10x Genomics scRNA-Seq (cellranger) count Pipeline
-========================================
+==================================================

 Introduction
 ------------
@@ -19,13 +23,14 @@ To Run:
  * **--fastq**
        * path to the fastq location
        * R1 and R2 only necessary but can include I2
+        * only fastq's in designFile (see below) are used, not present will be ignored
        * eg: **--fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/\*.fastq.gz'**
  * **--designFile**
        * path to design file (csv format) location
        * column 1 = "Sample"
        * column 2 = "fastq_R1"
        * column 3 = "fastq_R2"
-        * can have repeated "Sample" if there are multiole fastq R1/R2 pairs for the samples
+        * can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
        * eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/design.csv'**
    * **--genome**
        * reference genome
@@ -44,7 +49,7 @@ To Run:
        * eg: **--genome 'GRCh38-3.0.0'**
    * **--genomeLocationFull**
        * path to a custom genome
-        * if --genomeLocationFull is used --genome is not necessary and is overwritten
+        * if --genomeLocationFull is used --genome is not necessary and is ignored
        * eg. **--genomeLocationFull '/project/apps_database/cellranger/refdata-cellranger-GRCh38-3.0.0'**
    * **--expectCells**
        * expected number of cells to be detected
@@ -57,11 +62,11 @@ To Run:
    * **--forceCells**
        * forces filtering of the top number of cells matching this parameter
        * 0-10000
-        * if --forceCells is used then --expectedCells is not necessary and is overwritten
+        * if --forceCells is used then --expectedCells is not necessary and is ignored
        * eg: **--forceCells 10000**
    * **--kitVersion**
        * the library chemistry version number for the 10x Genomics Gene Expression kit
-        * setting to auto will attempt to autodetect from the detected cycle strategy in the fastq's
+        * setting to auto will attempt to autodetect from the detected sequencing strategy in the fastq's
        * version numbers are spelled out
        * --kitversion is only used if --version (cellranger version) is > 2
        * --version (cellranger version) 2.1.1 can only read --kitVersion of two (2)
@@ -69,7 +74,7 @@ To Run:
            * *'auto'*
            * *'three'*
            * *'two'*
-        * eg: **--kitVersion 'three'**'
+        * eg: **--kitVersion 'three'**
    * **--version**
        * cellranger version
        * --version (cellranger version) 2.1.1 can only read --kitVersion of two (2)
@@ -91,4 +96,4 @@ To Run:
 |---------|------------------------------------|------------------------------------|
 | sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
 | sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
-| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
\ No newline at end of file
+| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
--- a/astrocyte_pkg.yml
+++ b/astrocyte_pkg.yml
@@ -150,16 +150,6 @@ workflow_parameters:
    description: |
      10x cellranger version.

-  - id: feature
-    type: select
-    default: 'no'
-    choices:
-      - [ 'no', 'No']
-      - [ 'yes', 'Yes']
-    required: true
-    description: |
-      Additional features to count (only used in cellranger version 3+, ignored otherwise).
-
  - id: astrocyte
    type: select
    choices:

--- a/docs/index.md
+++ b/docs/index.md
 10x Genomics scRNA-Seq (cellranger) count Pipeline
-========================================
+==================================================

 Introduction
 ------------
@@ -24,7 +24,7 @@ To Run:
        * column 3 = "fastq_R2"
        * can have repeated "Sample" if there are multiole fastq R1/R2 pairs for the samples
        * eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/blob/8db3e25c13cb1463c2a50e510159c72380ae5826/docs/design.csv)
-    * **genome**
+  * **genome**
        * Reference species and genome used for alignment and subsequent analysis.
        * name of available 10x Gemomics premade reference genomes:
            * *'GRCh38-3.0.0'* = Human GRCh38 release 93
@@ -36,30 +36,39 @@ To Run:
            * *'hg19_and_mm10-3.0.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 93
            * *'hg19_and_mm10-1.2.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 84
            * *'ercc92-1.2.0'* = ERCC.92 Spike-In
-    * **expect cells**
+  * **expect cells**
        * Expected number of recovered cells.
        * guides cellranger in it's cutoff for background/low quality cells
        * as a guide it doesn't have to be exact
        * 0-10000
        * if --expextedCells is used then --forceCells is not necessary
        * only used if force cells is not entered or set to 0
-    * **force cells**
+   * **force cells**
        * Force pipeline to use this number of cells, bypassing the cell detection algorithm. Use this if the number of cells estimated by Cell Ranger is not consistent with the barcode rank plot. A value of 0 ignores this option. Any value other than 0 overrides expect-cells.
        * 0-10000
        * if force cells is used then expected cells is not necessary and is ignored
-    * **chemistry version**
+  * **chemistry version**
        * 10x single cell gene expression chemistry version (only used in cellranger version 3.x).
        * setting to auto will attempt to autodetect from the detected cycle strategy in the fastq's
        * chemistry version is only used if cellranger version is > 2.x
        * cellranger version 2.1.1 can only read chemistry version less than or equal to two (2)
-    * **cellranger version**
+   * **cellranger version**
        * 10x cellranger version.
        * cellranger version 2.1.1 can only read chemistry version less than or equal to two (2)

 * Design example:

-| Sample  | fastq_R1                           | fastq_R2                           |
-|---------|------------------------------------|------------------------------------|
-| sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
-| sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
-| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
+    | Sample  | fastq_R1                           | fastq_R2                           |
+    |---------|------------------------------------|------------------------------------|
+    | sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
+    | sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
+    | sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
+    
+
+
+Credits
+-------
+This worklow is was developed jointly with the [Bioinformatic Core Facility (BICF), Department of Bioinformatics](http://www.utsouthwestern.edu/labs/bioinformatics/)
+
+
+Please cite in publications: Pipeline was developed by BICF from funding provided by **Cancer Prevention and Research Institute of Texas (RP150596)**.
--- a/workflow/conf/biohpc.config
+++ b/workflow/conf/biohpc.config
@@ -2,20 +2,19 @@ process {
  executor = 'slurm'
  queue='super'

-  // Process specific configuration
-  $checkDesignFile {
+  withLabel: checkDesignFile {
    module = ['python/3.6.1-2-anaconda']
    executor = 'local'
  }
-  $count211 {
+  withLabel: count211 {
    module = ['cellranger/2.1.1']
    queue = '128GB,256GB,256GBv1,384GB'
  }
-  $count301 {
+  withLabel: count301 {
    module = ['cellranger/3.0.1']
    queue = '128GB,256GB,256GBv1,384GB'
  }
-  $count302 {
+  withLabel: count302 {
    module = ['cellranger/3.0.2']
    queue = '128GB,256GB,256GBv1,384GB'
  }

--- a/workflow/main.nf
+++ b/workflow/main.nf
@@ -105,7 +105,7 @@ chemistryParam302 = chemistryParam

 process count211 {
  queue '128GB,256GB,256GBv1,384GB'
-  tag "count211-$sample"
+  tag "$sample"

  publishDir "$outDir/${task.process}", mode: 'copy'

@@ -143,7 +143,7 @@ process count211 {

 process count301 {
  queue '128GB,256GB,256GBv1,384GB'
-  tag "count301-$sample"
+  tag "$sample"

  publishDir "$outDir/${task.process}", mode: 'copy'

@@ -182,13 +182,13 @@ process count301 {

 process count302 {
  queue '128GB,256GB,256GBv1,384GB'
-  tag "count302-$sample"
+  tag "$sample"

  publishDir "$outDir/${task.process}", mode: 'copy'

  input:

-  set sample, file("${sample}_S1_L00?_R1_001.fastq.gz"), file("${sample}_S1_L00?_R2_001.fastq.gz") from samples302
+  set sample, file("${sample}_S?_L001_R1_001.fastq.gz"), file("${sample}_S?_L001_R2_001.fastq.gz") from samples302
  file ref from refLocation302.first()
  expectCells302
  forceCells302
@@ -217,4 +217,4 @@ process count302 {
    cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --force-cells=$forceCells302 --chemistry="$chemistryParam302"
    """
  }
-}
\ No newline at end of file
+}
--- a/workflow/main.test.nf
+++ b/workflow/main.test.nf
-#!/usr/bin/env nextflow
-
-// Path to an input file, or a pattern for multiple inputs
-// Note - $baseDir is the location of this workflow file main.nf
-
-// Define Input variables
-params.fastq = "$baseDir/../test_data/*.fastq.gz"
-params.designFile = "$baseDir/../test_data/design.csv"
-params.genome = 'GRCh38-3.0.0'
-params.genomes = []
-params.genomeLocation = params.genome ? params.genomes[ params.genome ].loc ?: false : false
-params.expectCells = 10000
-params.forceCells = 0
-params.kitVersion = '3'
-params.chemistry = []
-params.chemistryParam = params.kitVersion ? params.chemistry[ params.kitVersion ].param ?: false : false
-params.version = '3.0.2'
-params.feature = 'yes'
-params.outDir = "$baseDir/output"
-
-// Define regular variables
-designLocation = Channel
-  .fromPath(params.designFile)
-  .ifEmpty { exit 1, "design file not found: ${params.designFile}" }
-fastqList = Channel
-  .fromPath(params.fastq)
-  .flatten()
-  .map { file -> [ file.getFileName().toString(), file.toString() ].join("\t") }
-  .collectFile(name: 'fileList.tsv', newLine: true)
-refLocation = Channel
-  .fromPath(params.genomeLocation+params.genome)
-  .ifEmpty { exit 1, "referene not found: ${params.genome}" }
-expectCells = params.expectCells
-forceCells = params.forceCells
-chemistryParam = params.chemistryParam
-version = params.version
-feature = params.feature
-featurechk = feature
-outDir = params.outDir
-
-process checkDesignFile {
-
-  publishDir "$outDir/${task.process}", mode: 'copy'
-
-  input:
-
-  file designLocation
-  file fastqList
-  featurechk
-
-  output:
-
-  file("*.checked.csv") into designPaths
-
-  script:
-
-  """
-  python3 $baseDir/scripts/check_design.test.py -d $designLocation -f $fastqList -t "$featurechk"
-  """
-}
-
-// Parse design file
-samples = designPaths
-  .splitCsv (sep: ',', header: true)
-  .map { row -> [ row.Sample, file(row.fastq_R1), file(row.fastq_R2) ] }
-  .groupTuple()
-  //.subscribe { println it }
-
-// Duplicate variables
-samples.into {
-  samples211
-  samples301
-  samples302
-}
-refLocation.into {
-  refLocation211
-  refLocation301
-  refLocation302
-}
-expectCells211 = expectCells
-expectCells301 = expectCells
-expectCells302 = expectCells
-forceCells211 = forceCells
-forceCells301 = forceCells
-forceCells302 = forceCells
-chemistryParam301 = chemistryParam
-chemistryParam302 = chemistryParam
-feature301 = feature
-feature302 = feature
--- a/workflow/scripts/check_design.test.py
+++ b/workflow/scripts/check_design.test.py
-#!/usr/bin/env python3
-
-'''Check if design file is correctly formatted and matches files list.'''
-
-import argparse
-import logging
-import pandas as pd
-
-EPILOG = '''
-For more details:
-        %(prog)s --help
-'''
-
-# SETTINGS
-
-logger = logging.getLogger(__name__)
-logger.addHandler(logging.NullHandler())
-logger.propagate = False
-logger.setLevel(logging.INFO)
-
-
-def get_args():
-    '''Define arguments.'''
-
-    parser = argparse.ArgumentParser(
-        description=__doc__, epilog=EPILOG,
-        formatter_class=argparse.RawDescriptionHelpFormatter)
-
-    parser.add_argument('-d', '--design',
-                        help="The design file to run QC (tsv format).",
-                        required=True )
-
-    parser.add_argument('-f', '--fastq',
-                        help="File with list of fastq files (tsv format).",
-                        required=True )
-
-    parser.add_argument('-t', '--feature',
-                        help="Additional features to count?",
-                        required=True )
-
-    args = parser.parse_args()
-    return args
-
-
-def check_design_headers_n(design):
-    '''Check if design file conforms to sequencing type.'''
-
-    # Default headers
-    design_template = [
-        'Sample',
-	    'fastq_R1',
-	    'fastq_R2']
-
-    design_headers = list(design.columns.values)
-
-    # Check if headers
-    logger.info("Running header check.")
-
-    missing_headers = set(design_template) - set(design_headers)
-
-    if len(missing_headers) > 0:
-        logger.error('Missing column headers: %s', list(missing_headers))
-        raise Exception("Missing column headers: %s" % list(missing_headers))
-    
-    return design
-
-def check_design_headers_y(design):
-    '''Check if design file conforms to sequencing type.'''
-
-    # Default headers
-    design_template = [
-        'Sample',
-	    'fastq_R1',
-	    'fastq_R2',
-	    'library_type']
-
-    design_headers = list(design.columns.values)
-
-    # Check if headers
-    logger.info("Running header check.")
-
-    missing_headers = set(design_template) - set(design_headers)
-
-    if len(missing_headers) > 0:
-        logger.error('Missing column headers: %s', list(missing_headers))
-        raise Exception("Missing column headers: %s" % list(missing_headers))
-    
-    return design
-
-def check_files(design, fastq):
-    '''Check if design file has the files found.'''
-
-    logger.info("Running file check.")
-
-    files = list(design['fastq_R1']) + list(design['fastq_R2'])
-
-    files_found = fastq['name']
-
-    missing_files = set(files) - set(files_found)
-
-    if len(missing_files) > 0:
-        logger.error('Missing files from design file: %s', list(missing_files))
-        raise Exception("Missing files from design file: %s" %
-            list(missing_files))
-    else:
-        file_dict = fastq.set_index('name').T.to_dict()
-    
-    design['fastq_R1'] = design['fastq_R1'].apply(lambda x: file_dict[x]['path'])
-    design['fastq_R2'] = design['fastq_R2'].apply(lambda x: file_dict[x]['path'])
-
-    return design
-
-
-def main():
-    args = get_args()
-    design = args.design
-
-    # Create a file handler
-    handler = logging.FileHandler('design.log')
-    logger.addHandler(handler)
-
-    # Read files as dataframes
-    design_df = pd.read_csv(args.design, sep=',')
-    fastq_df = pd.read_csv(args.fastq, sep='\t', names=['name', 'path'])
-
-    # Check design file
-    if args.feature == 'no':
-    	new_design_df = check_design_headers_n(design_df)
-    else:
-    	new_design_df = check_design_headers_y(design_df)
-	#new_design_df[['sample']].to_csv('library.checked.csv', header=True, sep=',', index=False)
-
-    check_files(design_df, fastq_df)
-    new_design_df.drop('library_type', 1).to_csv('design.checked.csv', header=True, sep=',', index=False)
-
-
-
-if __name__ == '__main__':
-    main()