-
Gervaise Henry authoredc8a7ab57
Code owners
Assign users and groups as approvers for specific file changes. Learn more.
To learn more about this project, read the wiki.
README.md 7.73 KiB
master | develop |
---|---|
RNA-Seq Analytic Pipeline for GUDMAP/RBK
Introduction
This pipeline was created to be a standard mRNA-sequencing analysis pipeline which integrates with the GUDMAP and RBK consortium data-hub. It is designed to run on the HPC cluster (BioHPC) at UT Southwestern Medical Center (in conjunction with the standard nextflow profile: config biohpc.config
)
Cloud Compatibility:
This pipeline is also capable of being run on AWS. To do so:
- Build a AWS batch queue and environment either manually or with aws-cloudformantion
- Edit one of the aws configs in workflow/config/
- Replace workDir with the S3 bucket generated
- Change region if different
- Change queue to the aws batch queue generated
- The user must have awscli configured with an appropriate authentication (with
aws configure
and access keys) in the environment which nextflow will be run - Add
-profile
with the name aws config which was customized
To Run:
- Available parameters:
-
--deriva
active credential.json file from deriva-auth -
--bdbag
active cookies.txt file from deriva-auth -
--repRID
mRNA-seq replicate RID -
--source
consortium server source- dev = dev.gudmap.org (default, does not contain all data)
- staging = staging.gudmap.org (does not contain all data)
- production = www.gudmap.org (does contain all data)
-
--refMoVersion
mouse reference version (optional) -
--refHuVersion
human reference version (optional) -
--refERCCVersion
human reference version (optional) -
-profile
config profile to use (optional):- defaut = processes on BioHPC cluster
- biohpc = process on BioHPC cluster
- biohpc_max = process on high power BioHPC cluster nodes (=> 128GB nodes), for resource testing
- aws_ondemand = AWS Batch on-demand instant requests
- aws_spot = AWS Batch spot instance requests
-
- NOTES:
- once deriva-auth is run and authenticated, the two files above are saved in
~/.deriva/
(see official documents from deriva on the lifetime of the credentials) - reference version consists of Genome Reference Consortium version, patch release and GENCODE annotation release # (leaving the params blank will use the default version tied to the pipeline version)
- current mouse 38.p6.vM22 = GRCm38.p6 with GENCODE annotation release M22
- current human 38.p6.v31 = GRCh38.p12 with GENCODE annotation release 31
- once deriva-auth is run and authenticated, the two files above are saved in
-
Optional input overrides
-
--refSource
source for pulling references- biohpc = source references from BICF_Core gudmap reference local location (workflow must be run on BioHPC system)
- datahub = source references from GUDMAP/RBK reference_table location (currently uses dev.gudmap.org)
-
--inputBagForce
utilizes a local replicate inputBag instead of downloading from the data-hub (still requires accurate repRID input)- eg:
--inputBagForce test_data/bag/Replicate_Q-Y5F6.zip
(must be the expected bag structure)
- eg:
-
--fastqsForce
utilizes local fastq's instead of downloading from the data-hub (still requires accurate repRID input)- eg:
--fastqsForce 'test_data/fastq/small/Q-Y5F6_1M.R{1,2}.fastq.gz'
(note the quotes around fastq's which must me named in the correct standard [*.R1.fastq.gz and/or *.R2.fastq.gz] and in the correct order)
- eg:
-
--speciesForce
forces the species to be "Mus musculus" or "Homo sapiens", it bypasses ambiguous species error- eg:
--speciesForce 'Mus musculus'
- eg:
-
- Tracking parameters (Tracking Site):
-
--ci
boolean (default = false) -
--dev
boolean (default = false)
-