Astrocyte RNASeq Workflow Package

Workflow SOP

This SOP describes the analysis pipeline of RNA sequencing data. This pipeline includes (1) quality control, (2) variant calling analysis, (3) identification of fusion genes, and (4) statistical analyses of gene expression and isoform expression. The result R data of the statistical analysis can be visualized using a custom R shiny service.

For each file this workflow:

1) Trim the ends of sequences with remaining adapter or quality scores < 25. Remove any sequence less than 35bp after trimming, then generate a file for capturing information about how many sequences were trimmed.
2) Trimmed Fastq files are aligned to the selected reference genome using HiSAT2 (Kim et al. 2015) or STAR (Dobin et al. 2012)
3) Marks Duplicates using SAMBAMBA
4) Features (genes, transcripts and exons) are counted using featureCounts (Liao et al 2014) and StringTie (Pertea et al. 2015) using the Gencode feature table(Harrow et al. 2012)
5) Basic pairwise differential expression analysis is performed using EdgeR (Robinson et al. 2010) and DESeq
6) Abundances of transcripts are calculated using ballgown (Frazee et al. 2014)
7) Identify gene fusions or fused transcripts using STAR-Fusion.

Workflow Parameters

fastq - Choose one or more RNASeq read files to process.
genome - Choose a genomic reference (genome).
stranded - If a stranded library is created, please select the appropriate protocol to ensure strand specific analysis
markdups - The default is to remove duplicates, you can skip this function.
pairs - Indicate if this is paired-end or single-end sequencing data 
dea - Perform Differential Expression analysis, please skip if there are < 2 sample groups
geneset - Select a set of known genesets for pathway analysis
fusion - Perform Gene Fusion Identification
design - This file matches the fastq files to data abot the sample

The following columns are necessary, must be named as in template and can be in any order:

    This ID should match the name in the fastq file ie S0001.R1.fastq.gz the sample ID is S0001
    Note: SampleID shouldn't start with numbers ie 10C should be changed to S10C
    This ID can be the identifier of the researcher or clinician
    Used in order to link samples from the same patient
This is the group that will be used for pairwise differential expression analysis
Name of the fastq file R1
Name of the fastq file R2

There are some optional columns that might help with the analysis: Tissue Gender CultureDate SequenceRun Organism CellPopulation Treatment GeneticFeature (WT or KO) Race Ethnicity Age

Test Data


This example worklow is derived from original scripts kindly contributed by the Bioinformatic Core Facility (BICF), Department of Bioinformatics

Workflow SOP



