GUDMAP_RBK

RNA-seq

Repository



master
develop


RNA-Seq Analytic Pipeline for GUDMAP/RBK

Introduction
This pipeline was created to be a standard mRNA-sequencing analysis pipeline which integrates with the GUDMAP and RBK consortium data-hub. It is designed to run on the HPC cluster (BioHPC) at UT Southwestern Medical Center (in conjunction with the standard nextflow profile: config biohpc.config)


Cloud Compatibility:
This pipeline is also capable of being run on AWS. To do so:

Build a AWS batch queue and environment either manually or with aws-cloudformantion

Edit one of the aws configs in workflow/config/

Replace workDir with the S3 bucket generated
Change region if different
Change queue to the aws batch queue generated


The user must have awscli configured with an appropriate authentication (with aws configure and access keys) in the environment which nextflow will be run
Add -profile with the name aws config which was customized


To Run:


Available parameters:


--deriva active credential.json file from deriva-auth


--bdbag active cookies.txt file from deriva-auth


--repRID mRNA-seq replicate RID

--source consortium server source


dev = dev.gudmap.org (default, does not contain all data)

staging = staging.gudmap.org (does not contain all data)

production = www.gudmap.org (does contain  all data)


--refMoVersion mouse reference version (optional, default = 38.p6.vM22)


--refHuVersion human reference version (optional, default = 38.p12.v31)


--refERCCVersion human reference version (optional, default = 92)


--upload option to not upload output back to the data-hub (optional, default = false)


true = upload outputs to the data-hub

false = do NOT upload outputs to the data-hub


-profile config profile to use (optional):

defaut = processes on BioHPC cluster

biohpc = process on BioHPC cluster

biohpc_max = process on high power BioHPC cluster nodes (=> 128GB nodes), for resource testing

aws_ondemand = AWS Batch on-demand instant requests

aws_spot = AWS Batch spot instance requests


--email email address(es) to send failure notification (comma separated) (optional):

e.g: --email 'Venkat.Malladi@utsouthwestern.edu,Gervaise.Henry@UTSouthwestern.edu'


NOTES:

once deriva-auth is run and authenticated, the two files above are saved in ~/.deriva/ (see official documents from deriva on the lifetime of the credentials)
reference version consists of Genome Reference Consortium version, patch release and GENCODE annotation release # (leaving the params blank will use the default version tied to the pipeline version)


current mouse 38.p6.vM22 = GRCm38.p6 with GENCODE annotation release M22

current human 38.p6.v31 = GRCh38.p12 with GENCODE annotation release 31


Optional input overrides


--refSource source for pulling references


biohpc = source references from BICF_Core gudmap reference local location (workflow must be run on BioHPC system)

datahub = source references from GUDMAP/RBK reference_table location (currently uses dev.gudmap.org)


--inputBagForce utilizes a local replicate inputBag instead of downloading from the data-hub (still requires accurate repRID input)

eg: --inputBagForce test_data/bag/Replicate_Q-Y5F6.zip (must be the expected bag structure)


--fastqsForce utilizes local fastq's instead of downloading from the data-hub (still requires accurate repRID input)

eg: --fastqsForce 'test_data/fastq/small/Q-Y5F6_1M.R{1,2}.fastq.gz' (note the quotes around fastq's which must me named in the correct standard [*.R1.fastq.gz and/or *.R2.fastq.gz] and in the correct order)


--speciesForce forces the species to be "Mus musculus" or "Homo sapiens", it bypasses ambiguous species error

eg: --speciesForce 'Mus musculus'


Tracking parameters (Tracking Site):


--ci boolean (default = false)

--dev boolean (default = true)


FULL EXAMPLE:

nextflow run workflow/rna-seq.nf --deriva ./data/credential.json --bdbag ./data/cookies.txt --repRID Q-Y5JA


To run a set of replicates from study RID:
Run in repo root dir:


sh workflow/scripts/splitStudy.sh [studyRID]
It will run in parallel in batches of 25 replicatesRID with 30 second delays between launches.

NOTE: Nextflow "local" processes for all replicates will run on the node/machine the bash script is launched from... consider running the study script on the BioHPC's SLURM cluster (use sbatch).


CHANGELOG


Credits
This workflow is was developed by Bioinformatic Core Facility (BICF), Department of Bioinformatics

PI
Venkat S. Malladi

Faculty Associate & Director

Bioinformatics Core Facility

UT Southwestern Medical Center

orcid.org/0000-0002-0144-0564

venkat.malladi@utsouthwestern.edu

Developers
Gervaise H. Henry

Computational Biologist

Department of Urology

UT Southwestern Medical Center

orcid.org/0000-0001-7772-9578

gervaise.henry@utsouthwestern.edu
Jonathan Gesell

Computational Biologist

Bioinformatics Core Facility

UT Southwestern Medical Center

orcid.org/0000-0001-5902-3299

johnathan.gesell@utsouthwestern.edu
Jeremy A. Mathews

Computational Intern

Bioinformatics Core Facility

UT Southwestern Medical Center

orcid.org/0000-0002-2931-1430

jeremy.mathews@utsouthwestern.edu
Please cite in publications: Pipeline was developed by BICF from funding provided by Cancer Prevention and Research Institute of Texas (RP150596).


Pipeline Directed Acyclic Graph