Commit 216212b1 authored by Brandi Cantarel's avatar Brandi Cantarel

first try again

parents
Copyright © 2016. The University of Texas Southwestern Medical Center
5323 Harry Hines Boulevard Dallas, Texas, 75390 Telephone 214-648-3111
\ No newline at end of file
# Astrocyte Example Workflow Package
This is an example workflow package for the BioHPC Astrocyte workflow engine. Astrocyte is a system allowing workflows to be run easily from the web, in a push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte allows users to access this workflow package using a simple web interface, created automatically from the definitions in this package.
This workflow package provides:
1) A sample ChIP-Seq data analysis workflow, which uses BWA and MACS to call peaks from one or more ChIP-Seq FASTQ input files. The workflow is implemented in the *Nextflow* workflow language.
2) A sample *R Shiny* visualization app, which provides a web-based tool for visualizing results.
3) Meta-data describing the workflow, it's inputs, output etc.
4) User-focused documentation, in markdown format, that will be displayed to users in the Astrocyte web interface.
5) Developer-focused documentation, in this file.
## Workflow Package Layout
Workflow packages for Astrocyte are Git repositories, and have a common layout which must be followed so that Astrocyte understands how to present them to users.
### Meta-Data
* `astrocyte_pkg.yml` - A file which contains the metadata describing the workflow in human & machine readable text format called *YAML*. This includes information about the workflow package such as it's name, synopsis, input parameters, outputs etc.
### The Workflow
* `workflow/main.nf` - A *Nextflow* workflow file, which will be run by Astrocyte using parameters provided by the user.
* `workflow/scripts` - A directory for any scripts (e.g. bash, python, ruby scripts) that the `main.nf` workflow will call. This might be empty if the workflow is implemented entirely in nextflow. You should *not* include large pieces of software here. Workflows should be designed to use *modules* available on the BioHPC cluster. The modules a workflow needs will be defined in the `astrocyte_pkg.yml` metadata file.
* `workflow/lib` - A directory for any netflow/groovy libraries that might be included by workflows using advanced features. Usualy empty.
### The Visualization App *(Optional)*
* `vizapp/` - A directory that will contain an *R Shiny* visualization app, if required. The vizualization app will be made available to the user via the Astrocyte web interface. At minimum the directory requires the standard Shiny `ui.R` and `server.R` files. The exact Shiny app structure is not prescribed. Any R packages required by the Shiny app will be listed in the `astrocyte_pkg.yml` metadata.
The visualization app will have access to any final output that was published to the `$baseDir` location in the
nextflow workflow. This path will be accessible as `Sys.getenv('baseDir')`.
### User Documentation
* `docs/index.md` - The first page of user documentation, in *markdown* format. Astrocyte will display this documentation to users of the workflow package.
* `docs/...` - Any other documentation files. *Markdown* `.md` files will be rendered for display on the web. Any images used in the documentation should also be placed here.
### Developer Documentation
* `README.md` - Documentation for developers of the workflow giving a brief overview and any important notes that are not for workflow users.
* `LICENSE.md` *(Optional)* - The license applied to the workflow package.
* `CHANGES.md` - A brief summary of changes made through time to the workflow.
### Testing
* `test_data/` - Every workflow package should include a minimal set of test data that allows the workflow to be run, testing its features. The `test_data/` directory is a location for test data files. Test data should be kept as small as possible. If large datasets (over 20MB total) are unavoidable provide a `fetch_test_data.sh` bash script which obtains the data from an external source.
* `test_data/fetch_test_data.sh` *Optional* - A bash script that fetches large test data from an external source, placing it into the `test_data/` directory.
## Testing/Running the Workflow with the Astrocyte CLI
Workflows will usually be run from the Astrocyte web interface, by importing the workflow package repository and making it available to users. During development you can use the Astrocyte CLI scripts to check, test, and run your workflow against non-test data.
To check the structure and syntax of the workflow package in the directory `astrocyte_example`:
```bash
$ astrocyte_cli check astrocyte_example
```
To launch the workflows defined tests, against included test data:
```bash
$ astrocyte_cli test astrocyte_example
```
To run the workflow using specific data and parameters. A working directory will be created.
```bash
$ astrocyte_cli run astrocyte_example --parameter1 "value1" --parameter2 "value2"...
```
To run the Shiny vizualization app against test_data
```bash
$ astrocyte_cli shinytest astrocyte_example
```
To run the Shiny vizualization app against output from `astrocyte_cli run`, which will be in the work directory created by `run`:
```bash
$ astrocyte_cli shiny astrocyte_example
```
To generate the user-facing documentation for the workflow and display it in a web browser"
```bash
$ astrocyte_cli docs astrocyte_example
```
## The Example ChIP-Seq Workflow ##
## The Example Shiny Viz App ##
## Provided Test Data ##
\ No newline at end of file
#
# metadata for the example astrocyte ChipSeq workflow package
#
# -----------------------------------------------------------------------------
# BASIC INFORMATION
# -----------------------------------------------------------------------------
# A unique identifier for the workflow package, text/underscores only
name: 'rnaseq_bicf'
# Who wrote this?
author: 'Brandi Cantarel'
# A contact email address for questions
email: 'biohpc-help@utsouthwestern.edu'
# A more informative title for the workflow package
title: 'BICF RNASeq Analysis Workflow'
# A summary of the workflow package in plain text
description: |
This is a workflow package for the BioHPC/BICF RNASeq workflow system.
It implements a simple RNASeq analysis workflow using TrimGalore, HiSAT,FeatureCounts,
StringTie and statistical analysis using EdgeR and Ballgown, plus a simple R Shiny
visualization application.
# -----------------------------------------------------------------------------
# DOCUMENTATION
# -----------------------------------------------------------------------------
# A list of documentation file in .md format that should be viewable from the
# web interface. These files are in the 'docs' subdirectory. The first file
# listed will be used as a documentation index and is index.md by convention
documentation_files:
- 'index.md'
# -----------------------------------------------------------------------------
# NEXTFLOW WORKFLOW CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The workflow file is always named 'workflow/main.f'
# The workflow must publish all final output into $baseDir
# A list of clueter environment modules that this workflow requires to run.
# Specify versioned module names to ensure reproducability.
workflow_modules:
- 'trimgalore/0.4.1'
- 'cutadapt/1.9.1'
- 'hisat2/2.0.1-beta-intel'
- 'samtools/intel/1.2'
- 'picard/1.127'
- 'subread/1.5.0-intel'
- 'stringtie/1.1.2-intel'
# A list of parameters used by the workflow, defining how to present them,
# options etc in the web interface. For each parameter:
#
# REQUIRED INFORMATION
# id: The name of the parameter in the NEXTFLOW workflow
# type: The type of the parameter, one of:
# string - A free-format string
# integer - An integer
# real - A real number
# file - A single file from user data
# files - One or more files from user data
# select - A selection from a list of values
# required: true/false, must the parameter be entered/chosen?
# description: A user friendly description of the meaning of the parameter
#
# OPTIONAL INFORMATION
# default: A default value for the parameter (optional)
# min: Minium value/characters/files for number/string/files types
# max: Maxumum value/characters/files for number/string/files types
# regex: A regular expression that describes valid entries / filenames
#
# SELECT TYPE
# choices: A set of choices presented to the user for the parameter.
# Each choice is a pair of value and description, e.g.
#
# choices:
# - [ 'myval', 'The first option']
# - [ 'myval', 'The second option']
#
# NOTE - All parameters are passed to NEXTFLOW as strings... but they
# are validated by astrocyte using the information provided above
workflow_parameters:
- id: datadir
type: files
required: true
description: |
Datadir with one or more input paired-end FASTQ files from a RNASeq experiment and a design file with the link between the same name and the sample group
# regex: ".*(fastq|fq).gz"
min: 1
- id: genome
type: select
choices:
- [ '/project/BICF/s166458/refdata/GRCh38', 'Human GRCh38']
- [ '/project/BICF/s166458/refdata/GRCm38', 'Mouse GRCm38']
required: true
description: |
Reference genome for alignment
# -----------------------------------------------------------------------------
# SHINY APP CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The vizapp is always 'vizapp/server.R' 'vizapp/ui.R'
# The workflow must publish all final output into $baseDir
# Name of the R module that the vizapp will run against
vizapp_r_module: 'R/3.2.1-Intel'
# List of any CRAN packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_cran_packages:
- sqldf
- shiny
- Vennerable
- DT
- ggplot2
- gplots
- gtools
- RColorBrewer
# # List of any Bioconductor packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_bioc_packages:
- qusage
- ballgown
- edgeR
- DESeq2
\ No newline at end of file
# Astrocyte Example Workflow Package
This is an example workflow package for the BioHPC Astrocyte workflow engine. Astrocyte is a system allowing workflows to be run easily from the web, in a push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte allows users to access this workflow package using a simple web interface, created automatically from the definitions in this package.
This workflow package provides:
1) A sample ChIP-Seq data analysis workflow, which uses BWA and MACS to call peaks from one or more ChIP-Seq FASTQ input files. The workflow is implemented in the *Nextflow* workflow language.
2) A sample *R Shiny* visualization app, which provides a web-based tool for visualizing results.
3) Meta-data describing the workflow, it's inputs, output etc.
4) User-focused documentation, in markdown format, that will be displayed to users in the Astrocyte web interface.
5) Developer-focused documentation, in this file.
## Workflow Package Layout
Workflow packages for Astrocyte are Git repositories, and have a common layout which must be followed so that Astrocyte understands how to present them to users.
### Meta-Data
* `astrocyte_pkg.yml` - A file which contains the metadata describing the workflow in human & machine readable text format called *YAML*. This includes information about the workflow package such as it's name, synopsis, input parameters, outputs etc.
### The Workflow
* `workflow/main.nf` - A *Nextflow* workflow file, which will be run by Astrocyte using parameters provided by the user.
* `workflow/scripts` - A directory for any scripts (e.g. bash, python, ruby scripts) that the `main.nf` workflow will call. This might be empty if the workflow is implemented entirely in nextflow. You should *not* include large pieces of software here. Workflows should be designed to use *modules* available on the BioHPC cluster. The modules a workflow needs will be defined in the `astrocyte_pkg.yml` metadata file.
* `workflow/lib` - A directory for any netflow/groovy libraries that might be included by workflows using advanced features. Usualy empty.
### The Visualization App *(Optional)*
* `vizapp/` - A directory that will contain an *R Shiny* visualization app, if required. The vizualization app will be made available to the user via the Astrocyte web interface. At minimum the directory requires the standard Shiny `ui.R` and `server.R` files. The exact Shiny app structure is not prescribed. Any R packages required by the Shiny app will be listed in the `astrocyte_pkg.yml` metadata.
The visualization app will have access to any final output that was published to the `$baseDir` location in the
nextflow workflow. This path will be accessible as `Sys.getenv('baseDir')`.
### User Documentation
* `docs/index.md` - The first page of user documentation, in *markdown* format. Astrocyte will display this documentation to users of the workflow package.
* `docs/...` - Any other documentation files. *Markdown* `.md` files will be rendered for display on the web. Any images used in the documentation should also be placed here.
### Developer Documentation
* `README.md` - Documentation for developers of the workflow giving a brief overview and any important notes that are not for workflow users.
* `LICENSE.md` *(Optional)* - The license applied to the workflow package.
* `CHANGES.md` - A brief summary of changes made through time to the workflow.
### Testing
* `test_data/` - Every workflow package should include a minimal set of test data that allows the workflow to be run, testing its features. The `test_data/` directory is a location for test data files. Test data should be kept as small as possible. If large datasets (over 20MB total) are unavoidable provide a `fetch_test_data.sh` bash script which obtains the data from an external source.
* `test_data/fetch_test_data.sh` *Optional* - A bash script that fetches large test data from an external source, placing it into the `test_data/` directory.
## Testing/Running the Workflow with the Astrocyte CLI
Workflows will usually be run from the Astrocyte web interface, by importing the workflow package repository and making it available to users. During development you can use the Astrocyte CLI scripts to check, test, and run your workflow against non-test data.
To check the structure and syntax of the workflow package in the directory `astrocyte_example`:
```bash
$ astrocyte_cli check astrocyte_example
```
To launch the workflows defined tests, against included test data:
```bash
$ astrocyte_cli test astrocyte_example
```
To run the workflow using specific data and parameters. A working directory will be created.
```bash
$ astrocyte_cli run astrocyte_example --parameter1 "value1" --parameter2 "value2"...
```
To run the Shiny vizualization app against test_data
```bash
$ astrocyte_cli shinytest astrocyte_example
```
To run the Shiny vizualization app against output from `astrocyte_cli run`, which will be in the work directory created by `run`:
```bash
$ astrocyte_cli shiny astrocyte_example
```
To generate the user-facing documentation for the workflow and display it in a web browser"
```bash
$ astrocyte_cli docs astrocyte_example
```
## The Example ChIP-Seq Workflow ##
## The Example Shiny Viz App ##
## Provided Test Data ##
\ No newline at end of file
bg.colors <- function(n) {
colorvec <- vector(mode="character", length=nrow(n))
r <- rainbow(4)
for (i in 1:nrow(n)) {
if (n[i,1] >= 2) {
if (n[i,2] >= 2) {
colorvec[i] = "purple"
}
else {
colorvec[i] = "cyan"
}
}
if (n[i,1] <= -2) {
if (n[i,2] >= 2) {
colorvec[i] = "red"
}
else {
colorvec[i] = "salmon"
}
}
if (n[i,2] >= -2 && n[i,2] <= 2) {
colorvec[i] = "grey"
}
if (n[i,1] >= -2 && n[i,1] <= 2) {
colorvec[i] = "grey"
}
}
c(colorvec)
}
options("repos"="http://cran.rstudio.com/")
update.packages()
install.packages("sqldf",dep=TRUE)
install.packages("gmp",dep=TRUE)
install.packages(c('gplots','lattice','latticeExtra','vegan','labdsv','cluster','ggplot2'))
install.packages("Vennerable", repos="http://R-Forge.R-project.org",type='source')
source("http://bioconductor.org/biocLite.R")
biocLite(c('graph', 'RBGL', 'RColorBrewer', 'reshape', 'gtools',"edgeR", "DESeq2","qusage","ballgown"))
library(shiny)
library(Vennerable)
library(qusage)
library(DT)
library(ggplot2)
library(ballgown)
library(sqldf)
shinyServer(function(input, output, session) {
data.dir <- 'data_files'
rda.dir <- 'rdaFiles'
symsyn <- read.table(file='symbol2synonym.txt',header=FALSE)
names(symsyn) <- c('symbol','synonyms')
source('functions.R', local = TRUE)
source('tools/intro.R', local = TRUE)
source('tools/gc.R', local = TRUE)
source('tools/qc.R', local = TRUE)
source('tools/altsplice.R', local = TRUE)
source('tools/dea.R', local = TRUE)
source('tools/gsea.R', local = TRUE)
#source('tools/fastqc.R', local = TRUE)
source('tools/gc_ui.R', local = TRUE)
source('tools/qc_ui.R', local = TRUE)
source('tools/dea_ui.R', local = TRUE)
source('tools/gsea_ui.R', local = TRUE)
#source('tools/fastqc_ui.R', local = TRUE)
source('tools/altsplice_ui.R', local = TRUE)
})
This diff is collapsed.
output$dir.splice <- renderUI({
datadir <- dir('/Data/bcantarel/trxomics/rdaFiles')
selectInput("dset.splice", "Choose Comparison", choices=datadir,selected='2012_sjia_monocyte.rda')
})
get.bgdata <- function(var) {
rdafile <- paste("/Data/bcantarel/trxomics/rdaFiles",var$dset.splice,sep='/')
genetrx <- read.table(file='/Data/bcantarel/trxomics/refdata/gene2trx.txt',sep='\t',header=TRUE)
geneid <- as.character(unique(genetrx[genetrx$SYMBOL %in% var$symsearch,]$g_id))
if (exists("var$enssearch") && var$enssearch != '') {
geneid <- as.character(unique(genetrx[genetrx$ENSEMBL %in% var$enssearch,]$g_id))
}
load(rdafile)
genebg <- subset(bg,paste0("gene_id=='",geneid,"'"))
transcript_gene_table = indexes(bg)$t2g
rownames(transcript_gene_table) <- transcriptNames(bg)
fpkm <- texpr(bg,meas='FPKM')
keep <- transcript_gene_table[transcript_gene_table$g_id == geneid,]$t_id
cttbl <- fpkm[keep,]
grps <- pData(bg)$group
trxnames <- transcript_gene_table[transcript_gene_table$g_id == geneid,]
dtm <- melt(t(cttbl))
names(dtm) <- c('sample','transcript','value')
dtm$grps <- rep(grps,length(keep))
if (length(keep) < 2) {
dtm$transcript <- rep(keep,length(keep))
}
dtm$grptrx <- paste(dtm$grps,dtm$transcript,sep='.')
test <- stattest(gown=genebg, pData=pData(bg), feature='transcript',covariate='group', libadjust=FALSE,getFC=TRUE)
if (length(keep) > 1) {
agg = collapseTranscripts(gene=geneid, gown=bg, k=var$kct, method='kmeans')
test <- stattest(gowntable=agg$tab, pData=pData(bg), feature='transcript_cluster',covariate='group', libadjust=FALSE,getFC=TRUE)
}
return(list(stattbl=test,gid=geneid,obj=genebg,cttbl=dtm,tname=trxnames))
}
find_sym <- function (sym) {
if (!(toupper(sym) %in% toupper(symsyn$symbol))) {
syns <- symsyn[grep(input$symsearch,symsyn$synonym,ignore.case=TRUE),]$symbol
synlist <- paste(as.character(syns),collapse=',')
if (length(syns) > 1) {
paste("Please Use Official Gene Symbols",synlist,sep=':')
}else {"Please Use Official Gene Symbols"}
}else {NULL}
}
getgeneid <- eventReactive(input$altButton,{
if (input$symsearch == '' & input$enssearch != '') {
validate (find_sym(input$symsearch))
}
get.bgdata(input)
})
output$plot.cluster <- renderPlot({
par(oma=c(4,4,1,1))
gid <- getgeneid()$gid
bg <- getgeneid()$obj
tname <- getgeneid()$tname
if (nrow(tname) > 1) {
plotLatentTranscripts(gene=gid, gown=bg, k=input$kct, method='kmeans', returncluster=FALSE)
}
})
output$gene.stat <- DT::renderDataTable({
getgeneid()$stattbl
},escape=FALSE)
output$trx.name <- DT::renderDataTable({
getgeneid()$tname
},escape=FALSE)
output$plot.means <- renderPlot({
par(oma=c(4,4,1,1))
par(cex.main=0.75)
gid <- getgeneid()$gid
bg <- getgeneid()$obj
tname <- getgeneid()$tname
if (nrow(tname) > 1) {
plotMeans(gid, bg, groupvar='group', meas='cov', colorby='transcript')
}
}, height = 900, width = 900)
output$trx.gene <- renderPlot({
countTable <- getgeneid()$cttbl
par(oma=c(4,4,1,1))
p <- ggplot(countTable,aes(x=grptrx,y=log2(value+1))) + geom_boxplot(trim=FALSE,aes(fill = factor(grptrx))) + geom_jitter(height = 0) + theme(legend.position="left",axis.text.x=element_text(angle=45,hjust=1, vjust=1),legend.key.height=unit(0.5,"line"),legend.text=element_text(size=8),legend.title=element_blank()) + ylab("Relative Abundance (FPKM)") + xlab("")
print(p)
})
output$ui_altsplice <- renderUI({
fluidPage(
includeCSS("www/style.css"),
sidebarLayout(
sidebarPanel(
uiOutput("dir.splice"),
textInput("symsearch", "Search By Gene Symbol", 'IL1B'),
textInput("enssearch", "Search By ENS ID",''),
numericInput("kct",label = "Number of Clusters",value = 2),
actionButton("altButton", "GO")
),
mainPanel(
tabsetPanel(
tabPanel("Gene Compare",
dataTableOutput('gene.stat'),
dataTableOutput('trx.name'),
plotOutput("plot.cluster"),
plotOutput("trx.gene"),
plotOutput("plot.means")
)
)
)
)
)
})
# UI-elements for DEA
output$dir.dea <- renderUI({
datadir <- dir(data.dir)
selectInput("dset.dea", "Choose Dataset", choices=datadir,selected='E-GEOD-60424')
})
output$pick.dea <- renderUI({
flist <- list.files(paste(data.dir,input$dset.dea,'dea',sep='/'),pattern="*.txt$")
selectInput("file1", "Choose Comparison Group 1", choices=flist)
})
output$pick.dea.comp <- renderUI({
flist <- list.files(paste(data.dir,input$dset.dea,'dea',sep='/'),pattern="*.txt$")
selectInput("file2", "Choose Comparison Group 2", choices=flist)
})
get.data <- function(var) {
f1 <- paste(data.dir,var$dset.dea,'dea',var$file1,sep='/')
f2 <- paste(data.dir,var$dset.dea,'dea',var$file2,sep='/')
comp1 <- read.table(f1,header=TRUE,sep='\t')
comp2 <- read.table(f2,header=TRUE,sep='\t')
comp1.filt <- na.omit(comp1[abs(comp1$logFC) >= var$fc.thresh & comp1$rawP <= var$pval.thresh,])
comp2.filt <- na.omit(comp2[abs(comp2$logFC) >= var$fc.thresh & comp2$rawP <= var$pval.thresh,])
if (var$adjust == 'FDR') {
comp1.filt <- na.omit(comp1[abs(comp1$logFC) >= var$fc.thresh & comp1$fdr <= var$pval.thresh,])
comp2.filt <- na.omit(comp2[abs(comp2$logFC) >= var$fc.thresh & comp2$fdr <= var$pval.thresh,])
}
if (var$adjust == 'BONF') {
comp1.filt <- na.omit(comp1[abs(comp1$logFC) >= var$fc.thresh & comp1$bonf <= var$pval.thresh,])
comp2.filt <- na.omit(comp2[abs(comp2$logFC) >= var$fc.thresh & comp1$bonf <= var$pval.thresh,])
}
genelist <- list(Comparison1=comp1.filt$symbol,
Comparison2=comp2.filt$symbol)
return(list(pair1=comp1,pair2=comp2,filt1=comp1.filt,
filt2=comp2.filt,glist=genelist,file1=f1,file2=f2))
}
tbls <- eventReactive(input$deButton,{get.data(input)})
output$dge.c1 <- DT::renderDataTable({
t1 <- tbls()$filt1
t1$symbol <- paste("<a href=http://www.genecards.org/cgi-bin/carddisp.pl?gene=",t1$symbol,'>',t1$symbol,"</a>",sep='')
t1$ensembl <- paste("<a href=http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=",t1$ensembl,'>',t1$ensembl,"</a>",sep='')
t1
},escape=FALSE,filter = 'top',options = list(lengthMenu = c(10, 25, 50, 200, -1)))
output$dge.c2 <- DT::renderDataTable({
t1 <- tbls()$filt2
t1$symbol <- paste("<a href=http://www.genecards.org/cgi-bin/carddisp.pl?gene=",t1$symbol,'>',t1$symbol,"</a>",sep='')
t1$ensembl <- paste("<a href=http://www.ensembl.org/Homo_sapiens/Gene/Summary?g=",t1$ensembl,'>',t1$ensembl,"</a>",sep='')
t1
},escape=FALSE,filter='top',options = list(lengthMenu = c(10, 25, 50, 200, -1)))
output$downloadC1 <- downloadHandler(
file <- paste(input$file1,".filt.txt",sep=""),
content = function(file) {
write.table(tbls()$filt1,file,quote=FALSE,row.names=FALSE,sep='\t')
})
output$downloadC2 <- downloadHandler(
file <- paste(input$file2,".filt.txt",sep=""),
content = function(file) {
write.table(tbls()$filt2,file,quote=FALSE,row.names=FALSE,sep='\t')
})
output$plot.pvalues <- renderPlot({
comp1 <- tbls()$pair1
comp2 <- tbls()$pair2
par(mfrow=c(3,2))
hist(na.omit(comp1$rawP),main='Raw Pvalues')
hist(na.omit(comp2$rawP),main='Raw Pvalues')
hist(na.omit(comp1$fdr),main='FDR')
hist(na.omit(comp2$fdr),main='FDR')
hist(na.omit(comp1$bonf),main='Bonferroni')
hist(na.omit(comp2$bonf),main='Bonferroni')
})
output$plot.fc <- renderPlot({