Commit f74b0d59 authored by Gervaise Henry's avatar Gervaise Henry 🤠
Browse files

Merge branch 'develop' into '10-count.features'

Develop

See merge request !35
parents fa36fc6f d02c3e8b
Pipeline #3387 passed with stages
in 20 minutes and 58 seconds
|*master*|*develop*|
|:-:|:-:|
|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/badges/master/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/commits/master)|[![Build Status](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/badges/develop/build.svg)](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/commits/develop)|
10x Genomics scRNA-Seq (cellranger) count Pipeline
========================================
==================================================
Introduction
------------
......@@ -19,13 +23,14 @@ To Run:
* **--fastq**
* path to the fastq location
* R1 and R2 only necessary but can include I2
* only fastq's in designFile (see below) are used, not present will be ignored
* eg: **--fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/\*.fastq.gz'**
* **--designFile**
* path to design file (csv format) location
* column 1 = "Sample"
* column 2 = "fastq_R1"
* column 3 = "fastq_R2"
* can have repeated "Sample" if there are multiole fastq R1/R2 pairs for the samples
* can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
* eg: **--designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/v3s2r100k/design.csv'**
* **--genome**
* reference genome
......@@ -44,7 +49,7 @@ To Run:
* eg: **--genome 'GRCh38-3.0.0'**
* **--genomeLocationFull**
* path to a custom genome
* if --genomeLocationFull is used --genome is not necessary and is overwritten
* if --genomeLocationFull is used --genome is not necessary and is ignored
* eg. **--genomeLocationFull '/project/apps_database/cellranger/refdata-cellranger-GRCh38-3.0.0'**
* **--expectCells**
* expected number of cells to be detected
......@@ -57,11 +62,11 @@ To Run:
* **--forceCells**
* forces filtering of the top number of cells matching this parameter
* 0-10000
* if --forceCells is used then --expectedCells is not necessary and is overwritten
* if --forceCells is used then --expectedCells is not necessary and is ignored
* eg: **--forceCells 10000**
* **--kitVersion**
* the library chemistry version number for the 10x Genomics Gene Expression kit
* setting to auto will attempt to autodetect from the detected cycle strategy in the fastq's
* setting to auto will attempt to autodetect from the detected sequencing strategy in the fastq's
* version numbers are spelled out
* --kitversion is only used if --version (cellranger version) is > 2
* --version (cellranger version) 2.1.1 can only read --kitVersion of two (2)
......@@ -69,7 +74,7 @@ To Run:
* *'auto'*
* *'three'*
* *'two'*
* eg: **--kitVersion 'three'**'
* eg: **--kitVersion 'three'**
* **--version**
* cellranger version
* --version (cellranger version) 2.1.1 can only read --kitVersion of two (2)
......@@ -91,4 +96,4 @@ To Run:
|---------|------------------------------------|------------------------------------|
| sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
\ No newline at end of file
| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
......@@ -150,16 +150,6 @@ workflow_parameters:
description: |
10x cellranger version.
- id: feature
type: select
default: 'no'
choices:
- [ 'no', 'No']
- [ 'yes', 'Yes']
required: true
description: |
Additional features to count (only used in cellranger version 3+, ignored otherwise).
- id: astrocyte
type: select
choices:
......
10x Genomics scRNA-Seq (cellranger) count Pipeline
========================================
==================================================
Introduction
------------
......@@ -24,7 +24,7 @@ To Run:
* column 3 = "fastq_R2"
* can have repeated "Sample" if there are multiole fastq R1/R2 pairs for the samples
* eg: can be downloaded [HERE](https://git.biohpc.swmed.edu/BICF/Astrocyte/cellranger_count/blob/8db3e25c13cb1463c2a50e510159c72380ae5826/docs/design.csv)
* **genome**
* **genome**
* Reference species and genome used for alignment and subsequent analysis.
* name of available 10x Gemomics premade reference genomes:
* *'GRCh38-3.0.0'* = Human GRCh38 release 93
......@@ -36,30 +36,39 @@ To Run:
* *'hg19_and_mm10-3.0.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 93
* *'hg19_and_mm10-1.2.0'* = Human GRCh37 (hg19) + Mouse GRCm38 (mm19) release 84
* *'ercc92-1.2.0'* = ERCC.92 Spike-In
* **expect cells**
* **expect cells**
* Expected number of recovered cells.
* guides cellranger in it's cutoff for background/low quality cells
* as a guide it doesn't have to be exact
* 0-10000
* if --expextedCells is used then --forceCells is not necessary
* only used if force cells is not entered or set to 0
* **force cells**
* **force cells**
* Force pipeline to use this number of cells, bypassing the cell detection algorithm. Use this if the number of cells estimated by Cell Ranger is not consistent with the barcode rank plot. A value of 0 ignores this option. Any value other than 0 overrides expect-cells.
* 0-10000
* if force cells is used then expected cells is not necessary and is ignored
* **chemistry version**
* **chemistry version**
* 10x single cell gene expression chemistry version (only used in cellranger version 3.x).
* setting to auto will attempt to autodetect from the detected cycle strategy in the fastq's
* chemistry version is only used if cellranger version is > 2.x
* cellranger version 2.1.1 can only read chemistry version less than or equal to two (2)
* **cellranger version**
* **cellranger version**
* 10x cellranger version.
* cellranger version 2.1.1 can only read chemistry version less than or equal to two (2)
* Design example:
| Sample | fastq_R1 | fastq_R2 |
|---------|------------------------------------|------------------------------------|
| sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
| Sample | fastq_R1 | fastq_R2 |
|---------|------------------------------------|------------------------------------|
| sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
| sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
Credits
-------
This worklow is was developed jointly with the [Bioinformatic Core Facility (BICF), Department of Bioinformatics](http://www.utsouthwestern.edu/labs/bioinformatics/)
Please cite in publications: Pipeline was developed by BICF from funding provided by **Cancer Prevention and Research Institute of Texas (RP150596)**.
......@@ -2,20 +2,19 @@ process {
executor = 'slurm'
queue='super'
// Process specific configuration
$checkDesignFile {
withLabel: checkDesignFile {
module = ['python/3.6.1-2-anaconda']
executor = 'local'
}
$count211 {
withLabel: count211 {
module = ['cellranger/2.1.1']
queue = '128GB,256GB,256GBv1,384GB'
}
$count301 {
withLabel: count301 {
module = ['cellranger/3.0.1']
queue = '128GB,256GB,256GBv1,384GB'
}
$count302 {
withLabel: count302 {
module = ['cellranger/3.0.2']
queue = '128GB,256GB,256GBv1,384GB'
}
......
......@@ -105,7 +105,7 @@ chemistryParam302 = chemistryParam
process count211 {
queue '128GB,256GB,256GBv1,384GB'
tag "count211-$sample"
tag "$sample"
publishDir "$outDir/${task.process}", mode: 'copy'
......@@ -143,7 +143,7 @@ process count211 {
process count301 {
queue '128GB,256GB,256GBv1,384GB'
tag "count301-$sample"
tag "$sample"
publishDir "$outDir/${task.process}", mode: 'copy'
......@@ -182,13 +182,13 @@ process count301 {
process count302 {
queue '128GB,256GB,256GBv1,384GB'
tag "count302-$sample"
tag "$sample"
publishDir "$outDir/${task.process}", mode: 'copy'
input:
set sample, file("${sample}_S1_L00?_R1_001.fastq.gz"), file("${sample}_S1_L00?_R2_001.fastq.gz") from samples302
set sample, file("${sample}_S?_L001_R1_001.fastq.gz"), file("${sample}_S?_L001_R2_001.fastq.gz") from samples302
file ref from refLocation302.first()
expectCells302
forceCells302
......@@ -217,4 +217,4 @@ process count302 {
cellranger count --id="$sample" --transcriptome="./$ref" --fastqs=. --sample="$sample" --force-cells=$forceCells302 --chemistry="$chemistryParam302"
"""
}
}
\ No newline at end of file
}
#!/usr/bin/env nextflow
// Path to an input file, or a pattern for multiple inputs
// Note - $baseDir is the location of this workflow file main.nf
// Define Input variables
params.fastq = "$baseDir/../test_data/*.fastq.gz"
params.designFile = "$baseDir/../test_data/design.csv"
params.genome = 'GRCh38-3.0.0'
params.genomes = []
params.genomeLocation = params.genome ? params.genomes[ params.genome ].loc ?: false : false
params.expectCells = 10000
params.forceCells = 0
params.kitVersion = '3'
params.chemistry = []
params.chemistryParam = params.kitVersion ? params.chemistry[ params.kitVersion ].param ?: false : false
params.version = '3.0.2'
params.feature = 'yes'
params.outDir = "$baseDir/output"
// Define regular variables
designLocation = Channel
.fromPath(params.designFile)
.ifEmpty { exit 1, "design file not found: ${params.designFile}" }
fastqList = Channel
.fromPath(params.fastq)
.flatten()
.map { file -> [ file.getFileName().toString(), file.toString() ].join("\t") }
.collectFile(name: 'fileList.tsv', newLine: true)
refLocation = Channel
.fromPath(params.genomeLocation+params.genome)
.ifEmpty { exit 1, "referene not found: ${params.genome}" }
expectCells = params.expectCells
forceCells = params.forceCells
chemistryParam = params.chemistryParam
version = params.version
feature = params.feature
featurechk = feature
outDir = params.outDir
process checkDesignFile {
publishDir "$outDir/${task.process}", mode: 'copy'
input:
file designLocation
file fastqList
featurechk
output:
file("*.checked.csv") into designPaths
script:
"""
python3 $baseDir/scripts/check_design.test.py -d $designLocation -f $fastqList -t "$featurechk"
"""
}
// Parse design file
samples = designPaths
.splitCsv (sep: ',', header: true)
.map { row -> [ row.Sample, file(row.fastq_R1), file(row.fastq_R2) ] }
.groupTuple()
//.subscribe { println it }
// Duplicate variables
samples.into {
samples211
samples301
samples302
}
refLocation.into {
refLocation211
refLocation301
refLocation302
}
expectCells211 = expectCells
expectCells301 = expectCells
expectCells302 = expectCells
forceCells211 = forceCells
forceCells301 = forceCells
forceCells302 = forceCells
chemistryParam301 = chemistryParam
chemistryParam302 = chemistryParam
feature301 = feature
feature302 = feature
#!/usr/bin/env python3
'''Check if design file is correctly formatted and matches files list.'''
import argparse
import logging
import pandas as pd
EPILOG = '''
For more details:
%(prog)s --help
'''
# SETTINGS
logger = logging.getLogger(__name__)
logger.addHandler(logging.NullHandler())
logger.propagate = False
logger.setLevel(logging.INFO)
def get_args():
'''Define arguments.'''
parser = argparse.ArgumentParser(
description=__doc__, epilog=EPILOG,
formatter_class=argparse.RawDescriptionHelpFormatter)
parser.add_argument('-d', '--design',
help="The design file to run QC (tsv format).",
required=True )
parser.add_argument('-f', '--fastq',
help="File with list of fastq files (tsv format).",
required=True )
parser.add_argument('-t', '--feature',
help="Additional features to count?",
required=True )
args = parser.parse_args()
return args
def check_design_headers_n(design):
'''Check if design file conforms to sequencing type.'''
# Default headers
design_template = [
'Sample',
'fastq_R1',
'fastq_R2']
design_headers = list(design.columns.values)
# Check if headers
logger.info("Running header check.")
missing_headers = set(design_template) - set(design_headers)
if len(missing_headers) > 0:
logger.error('Missing column headers: %s', list(missing_headers))
raise Exception("Missing column headers: %s" % list(missing_headers))
return design
def check_design_headers_y(design):
'''Check if design file conforms to sequencing type.'''
# Default headers
design_template = [
'Sample',
'fastq_R1',
'fastq_R2',
'library_type']
design_headers = list(design.columns.values)
# Check if headers
logger.info("Running header check.")
missing_headers = set(design_template) - set(design_headers)
if len(missing_headers) > 0:
logger.error('Missing column headers: %s', list(missing_headers))
raise Exception("Missing column headers: %s" % list(missing_headers))
return design
def check_files(design, fastq):
'''Check if design file has the files found.'''
logger.info("Running file check.")
files = list(design['fastq_R1']) + list(design['fastq_R2'])
files_found = fastq['name']
missing_files = set(files) - set(files_found)
if len(missing_files) > 0:
logger.error('Missing files from design file: %s', list(missing_files))
raise Exception("Missing files from design file: %s" %
list(missing_files))
else:
file_dict = fastq.set_index('name').T.to_dict()
design['fastq_R1'] = design['fastq_R1'].apply(lambda x: file_dict[x]['path'])
design['fastq_R2'] = design['fastq_R2'].apply(lambda x: file_dict[x]['path'])
return design
def main():
args = get_args()
design = args.design
# Create a file handler
handler = logging.FileHandler('design.log')
logger.addHandler(handler)
# Read files as dataframes
design_df = pd.read_csv(args.design, sep=',')
fastq_df = pd.read_csv(args.fastq, sep='\t', names=['name', 'path'])
# Check design file
if args.feature == 'no':
new_design_df = check_design_headers_n(design_df)
else:
new_design_df = check_design_headers_y(design_df)
#new_design_df[['sample']].to_csv('library.checked.csv', header=True, sep=',', index=False)
check_files(design_df, fastq_df)
new_design_df.drop('library_type', 1).to_csv('design.checked.csv', header=True, sep=',', index=False)
if __name__ == '__main__':
main()
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment