master | develop |
---|---|
10x Genomics scRNA-Seq (cellranger) count Pipeline
Introduction
This pipeline is a wrapper for the cellranger count tool from 10x Genomics. It takes fastq files from 10x Genomics Single Cell Gene Expression libraries, performs alignment, filtering, barcode counting, and UMI counting. It uses the Chromium cellular barcodes to generate gene-barcode matrices, determine clusters, and perform gene expression analysis.
The pipeline uses Nextflow, a bioinformatics workflow tool.
This pipeline is primarily used with a SLURM cluster on the BioHPC Cluster. However, the pipeline should be able to run on any system that Nextflow supports.
Additionally, the pipeline is designed to work with Astrocyte Workflow System using a simple web interface.
Cloud Compatibility
This pipeline is also capable of being run on AWS.
NOTE: This pipeline has been reverted to a non-containerized version to work on Astrocyte. Tag containerized
for the last working containerized version which will be compatible with AWS.
To do so:
- Build a AWS batch queue and environment either manually or with aws-cloudformantion
- Edit one of the aws configs in workflow/config/
- Replace workDir with the S3 bucket generated
- Change region if different
- Change queue to the aws batch queue generated
- The user must be have awscli configured with an appropriate authentication (with
aws configure
and access keys) in the environment which nextflow will be run - Add
-profile
with the name aws config which was custamized- eg.
nextflow run workflow/main.nf -profile aws_ondemand
- eg.
To Run:
- Available parameters:
-
-profile
- what environments to run on, available:
biohpc
,local
,cluster
,aws
,ondemand
,spot
- eg: -profile biohpc,cluster to run on BioHPC in cluster mode
- eg: -profile aws,ondemand to run on AWS on a on-demand queue
- what environments to run on, available:
-
--fastq
- path to the fastq location
- R1 and R2 only necessary but can include I2
- only fastq's in designFile (see below) are used, not present will be ignored
- eg: --fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r10k/*.fastq.gz'
-
--designFile
- path to design file (csv format) location
- column 1 = "Sample"
- column 2 = "fastq_R1"
- column 3 = "fastq_R2"
- can have repeated "Sample" if there are multiple fastq R1/R2 pairs for the samples
- can be downloaded HERE
- eg: --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r10k/design.csv'
-
--genome
- reference genome
- requires workflow/conf/biohpc.config to work
- name of available 10x Gemomics premade reference genomes:
- 'GRCh38-3.0.0' = Human GRCh38 release 93
- 'GRCh38-1.2.0' = Human GRCh38 release 84
- 'hg19-3.0.0' = Human GRCh37 (hg19) release 87
- 'hg19-1.2.0' = Human GRCh37 (hg19) release 84
- 'mm10-3.0.0' = Mouse GRCm38 (mm10) release 93
- 'mm10-3.0.0' = Mouse GRCm38 (mm10) release 84
- 'GRCh38_and_mm10-3.1.0' = Human GRCh38 + Mouse GRCm38 (mm10) release 93
- 'hg19_and_mm10-3.0.0' = Human GRCh37 (hg19) + Mouse GRCm38 (mm10) release 93
- 'hg19_and_mm10-1.2.0' = Human GRCh37 (hg19) + Mouse GRCm38 (mm10) release 84
- 'ercc92-1.2.0' = ERCC.92 Spike-In
- if --genome is used then --genomeLocationFull is not necessary
- eg: --genome 'GRCh38-3.0.0'
-
--genomeLocationFull
- path to a custom genome
- if --genomeLocationFull is used --genome is not necessary and is ignored
- eg. --genomeLocationFull '/project/apps_database/cellranger/refdata-cellranger-GRCh38-3.0.0'
-
--expectCells
- expected number of cells to be detected
- guides cellranger in it's cutoff for background/low quality cells
- as a guide it doesn't have to be exact
- 0-10000
- if --expextedCells is used then --forceCells is not necessary
- only used if --forceCells is not entered or set to 0
- eg: --expectCells 10000
-
--forceCells
- forces filtering of the top number of cells matching this parameter
- 0-10000
- if --forceCells is used then --expectedCells is not necessary and is ignored
- eg: --forceCells 10000
-
--kitVersion
- the library chemistry version number for the 10x Genomics Gene Expression kit
- setting to auto will attempt to autodetect from the detected sequencing strategy in the fastq's
- version numbers are spelled out
- --kitversion is only used if --version (cellranger version) is > 2
- --version (cellranger version) 2.1.1 can only read --kitVersion of two (2)
- options:
- 'auto'
- '3GEXv3'
- '3GEXv2'
- '5GEX'
- eg: --kitVersion '3GEXv3'
-
--version
- cellranger version
- --version (cellranger version) 2.1.1 can only read --kitVersion of 3GEXv2
- options:
- '3.1.0'
- '3.0.2'
- '2.1.1'
- eg: --version '3.1.0'
-
--vizFiles
- create objects which can be used for downstream visualization and analysis of each sample outputs, currently creates:
- Seurat R-objects
- true/false
- eg: --version true
- create objects which can be used for downstream visualization and analysis of each sample outputs, currently creates:
-
--outDir
- optional output directory for run
- eg: --outDir 'test'
-
-profile
- FULL EXAMPLE:
nextflow run workflow/main.nf -profile biohpc,cluster --fastq '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r10k/*.fastq.gz' --designFile '/project/shared/bicf_workflow_ref/workflow_testdata/cellranger/cellranger_count/hu.v3s2r10k/design.csv' --genome 'GRCh38-3.0.0' --kitVersion '3GEXv3' --version '3.1.0' --vizFiles true --outDir 'test'
- Design example:
Sample | fastq_R1 | fastq_R2 |
---|---|---|
sample1 | pbmc_1k_v2_S1_L001_R1_001.fastq.gz | pbmc_1k_v2_S1_L001_R2_001.fastq.gz |
sample2 | pbmc_1k_v2_S2_L001_R1_001.fastq.gz | pbmc_1k_v2_S2_L001_R2_001.fastq.gz |
sample2 | pbmc_1k_v2_S2_L002_R1_001.fastq.gz | pbmc_1k_v2_S2_L002_R2_001.fastq.gz |
Credits
This worklow is was developed jointly with the Bioinformatic Core Facility (BICF), Department of Bioinformatics
Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).