Commit 94fd351b authored by David Trudgian's avatar David Trudgian

Initial template workflow

parents
This diff is collapsed.
2016-02-29
----------
Initial Version.
Copyright © 2016. The University of Texas Southwestern Medical Center
5323 Harry Hines Boulevard Dallas, Texas, 75390 Telephone 214-648-3111
\ No newline at end of file
# Astrocyte Example Workflow Package
This is an example workflow package for the BioHPC Astrocyte workflow engine.
Astrocyte is a system allowing workflows to be run easily from the web in a
push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte
allows users to access this workflow package using a simple web interface,
created automatically from the definitions in this package.
## This Example Package
This workflow package provides:
1) A sample ChIP-Seq data analysis workflow, which uses BWA to align reads to
a reference genome, and MACS to call peaks. The workflow is written in the
[*Nextflow*](http://www.nextflow.io) workflow language. *Nextflow* is a
simple yet powerful workflow scripting language based on the *Groovy*
scripting language. It supports advanced features such as implicit
parallelization on the cluster - Nextflow will launch concurrent jobs for
each input file.
2) A sample *Shiny* visualization app, which provides a web-based tool for
visualizing results. *Shiny* is a framework to provide web interfaces to
data and analysis implemented in the *R* statistical language. *R* is a
powerful language for manipulating and interrogating data, and *Shiny* allows
analysis in R to be presented simply and easily as a web application.
3) Meta-data describing the workflow, it's inputs, output etc. The Astrocyte
web application and command-line runner use this meta-data to understand the
workflow, what input it needs, how the documentation is arranged etc.
4) User-focused documentation, in *markdown* format, that will be displayed to
users in the Astrocyte web interface. Markdown is a simple plain-text based
syntax which is especially suited for writing documentation that will be
displayed on the web.
5) Developer-focused documentation, in this file - `README.md`. This
documentation should summmarize features of the workflow package that are of
interest to anyone who would want to extend it, or use it as a template for
their own work.
## Workflow Package Layout
Workflow packages for Astrocyte are Git repositories, and have a common layout
which must be followed so that Astrocyte understands how to present them to
users. The folder structure, and names of key files listed below should not be
changed. Although a workflow package with a modified structure may work, it is
not guaranteed to be accepted by future versions of Astrocyte.
The following structure of files and directories is always present:
```
- docs/
index.md
- test_data/
- vizapp/
server.R
ui.R
- workflow/
- lib/
- output/
- scripts/
main.nf
astrocyte_pkg.yml
CHANGES.md
LICENSE.md
README.md
```
### Meta-Data
* `astrocyte_pkg.yml` - A file in the root directory of the package, which
contains the metadata describing the workflow in human & machine readable text
format called *YAML*. This includes information about the workflow package
such as it's name, synopsis, input parameters, outputs etc.
See the documentation inside the example `astrocyte_pkg.yml` file for a
guide to specifying Astrocyte metadata.
### The Workflow
* `workflow/main.nf` - A *Nextflow* workflow file, which will be run by
Astrocyte using parameters provided by the user.
* `workflow/scripts` - A directory for any scripts (e.g. bash, python,
ruby scripts) that the `main.nf` workflow will call. This might be empty if
the workflow is implemented entirely in nextflow. You should *not* include
large pieces of software here. Workflows should be designed to use *modules*
available on the BioHPC cluster. The modules a workflow needs will be defined
in the `astrocyte_pkg.yml` metadata file.
* `workflow/lib` - A directory for any netflow/groovy libraries that might be
included by workflows using advanced features. Usually empty for simpler
workflows.
* `workflow/output` - An empty directory, into which an final output of the
workflow should be published using the `publishDir "$baseDir/output", mode: 'copy'`
directive inside a process.
To learn about the *Nextflow* language, take a look at this and other example
workflows, and refer to the [nextflow.io](http://www.nextflow.io) website.
Nextflow workflows used in an Astrocyte package must be written in a certain
way, with specific rules so that Astrocyte can run them successfully on the
cluster. See the *Workflow Requirements* section below for details.
### The Visualization App *(Optional)*
* `vizapp/` - A directory that will contain an *R Shiny* visualization app, if
required. The vizualization app will be made available to the user via the
Astrocyte web interface. At minimum the directory requires the standard Shiny
`ui.R` and `server.R` files. The exact Shiny app structure is not
prescribed. Any R packages required by the Shiny app will be listed in the
`astrocyte_pkg.yml` metadata.
Shiny apps used in an Astrocyte package must be written in a certain
way, with specific rules so that Astrocyte can run them successfully, and find
data files to visualize. See the *Vizapp Requirements* section below for
details.
### User Documentation
* `docs/index.md` - The first page of user documentation, in *markdown*
format. Astrocyte will display this documentation to users of the workflow
package.
* `docs/...` - Any other documentation files. *Markdown* `.md` files will be
rendered for display on the web. Any images used in the documentation should
also be placed here.
### Developer Documentation
* `README.md` - Documentation for developers of the workflow giving a brief
overview and any important notes that are not for workflow users.
* `LICENSE.md` *(Optional)* - The license applied to the workflow package.
* `CHANGES.md` - A brief summary of changes made through time to the workflow.
### Testing
* `test_data/` - Every workflow package should include a minimal set of test
data that allows the workflow to be run, testing its features. The
`test_data/` directory is a location for test data files. Test data should be
kept as small as possible. If large datasets (over 20MB total) are unavoidable
provide a `fetch_test_data.sh` bash script which obtains the data from an
external source.
* `test_data/fetch_test_data.sh` *Optional* - A bash script that fetches large
test data from an external source, placing it into the `test_data/` directory.
# Workflow Requirements & Testing
So that Astrocyte can successfully run a workflow for any BioHPC user, and
make efficient use of the Nucleus compute cluster, the Nextflow workflow must
be written according to some rules.
## Astrocyte / Nextflow Basics
A Nextflow workflow run by astrocyte must be in a file named `workflow/main.nf`
within the workflow package. Do not use other filenames for your workflow.
When a workflow runs on the Astrocyte platform the work area will be created
dynamically and its path will not be known in advance. The `$baseDir`
variable can be used inside the Nextflow workflow to refer to this path.
Data files for analysis will be uploaded or linked into Astrocyte by users.
Workflows cannot access input data directly. **All input file names must be
accepted as workflow parameters**. Astrocyte will allow users to select input
files, and pass the parameter values to your workflow.
Output files for the user should be published to `$baseDir/output` using the nextflow
directive `publishDir "$baseDir/output, mode: 'copy''"` in a process block.
You can create a directory structure inside `$baseDir/output` if required to
organize the output files. Note that we use the 'copy' mode so that Nextflow's
work directories can be cleaned up by Astrocyte. By default Nextflow will link
published files from work directories, so output would be lost on cleanup if
`mode: copy` is not used.
Reference data paths, even in permanent locations such as
`/project/apps_database/iGenomes` should not be hard-coded into workflows.\
Provide parameters in your workflow for specifying reference files, and then
specify possible choices for those parameters in the `astrocyte_pkg.yml`
metadata file. If you need to make reference data available for a workflow,
please ask the BioHPC team to place it in the central `/project/apps_database`
area.
The `workflow/scripts` directory in a package is reserved for any scripts,
e.g. Perl, Python, Bash scripts which implement processing that you don't want
to write orre-write in Nextflow. They should be called from `main.nf` as a
Nextflow *process*. The path to the scripts directory will be `$baseDir/scripts`
and this should be used instead of other relative or absolute paths within your
workflow. Don't put large, or even small applications here. Please use cluster
modules within your workflow, and ask BioHPC to install software as a module if
you need it.
## Optimizations
To make sure that your workflow can be run efficiently on the BioHPC cluster
please:
* **Do** specify cpu and memory requirements for Nextflow processes using the
`cpus` and `memory` directives. Nextflow will use this information to schedule
tasks and complete a job as quickly as possible.
* **Do** split complex sections of a workflow into multiple Nextflow processes
so that they can be parallelized by the system.
* **Do** use software modules from the cluster, and specify exact versions for
modules when loading them in your workflow.
* **Do** thorougly check the execution of your workflow using the command line
runner, before attempting to bring it into the Astrocyte web application
* **Don't** use absolute paths for any files - see above.
* **Don't** work with any files outside of `$baseDir`, except permanent
reference data administered by BioHPC.
# Vizapp / Shiny Requirements
The visualization app will have access to any final output that was published
to the `$baseDir/output` location in the nextflow workflow. This path will be
accessible as `Sys.getenv('outputDir')`.
Parameters specified when the workflow was launched will also be available as
environment variables - their name prefixed by `param-` e.g. the workflow
parameter `fastqs` will be available in R as `Sys.getenv('param-fastqs')`.
* **Don't** try to access any files outside of `outputDir` except permanent
reference data administered by biohpc.
* **Do** list any CRAN or Bioconductor packages needed by the vizapp in the
`astrocyte_pkg.yml` metadata file.
* **Don't** do heavy processing in the vizapp. Shiny apps in Astrocyte share
resources, and are intended for basic visualization. You may wish to provide
instructions to users in your workflow package documentation directing them
to RStudio for follow-up work. Moderate I/O (e.g. scanning a large reference
file) is acceptable, as BioHPC systems have access to high performance
storage.
# Testing/Running the Workflow with the Astrocyte CLI
**Work in Progress - CLI not yet available on the cluster**
Workflows will usually be run from the Astrocyte web interface, by importing the
workflow package repository and making it available to users. During development
You can use the Astrocyte CLI scripts to check, test, and run your workflow
against non-test data.
First load the `astrocyte` module on a biohpc system.
To check the structure and syntax of the workflow package in the directory
`astrocyte_example_chipseq`:
```bash
$ astrocyte_cli check astrocyte_example_chipseq
```
To launch the workflows defined tests, against included test data:
```bash
$ astrocyte_cli test astrocyte_example_chipseq
```
To run the workflow using specific data and parameters. A working directory will
be created.
```bash
$ astrocyte_cli run astrocyte_example_chipseq --parameter1 "value1" --parameter2 "value2"...
```
To run the Shiny vizualization app against test_data
```bash
$ astrocyte_cli shinytest astrocyte_example_chipseq
```
To run the Shiny vizualization app against output from `astrocyte_cli run`,
which will be in the work directory created by `run`:
```bash
$ astrocyte_cli shiny astrocyte_example_chipseq
```
To generate the user-facing documentation for the workflow and display it in a
web browser:
```bash
$ astrocyte_cli docs astrocyte_example_chipseq
```
#
# metadata for the example astrocyte ChipSeq workflow package
#
# -----------------------------------------------------------------------------
# BASIC INFORMATION
# -----------------------------------------------------------------------------
# A unique identifier for the workflow package, text/underscores only
name: 'astrocyte_example'
# Who wrote this?
author: 'David Trudgian'
# A contact email address for questions
email: 'biohpc-help@utsouthwestern.edu'
# A more informative title for the workflow package
title: 'Astrocyte Example ChIPSeq Workflow'
# A summary of the workflow package in plain text
description: |
This is an example workflow package for the BioHPC astrocyte workflow system.
It implements a simple ChIPSeq analysis workflow using BWA and MACS, plus a
simple R Shiny visualization application.
# -----------------------------------------------------------------------------
# DOCUMENTATION
# -----------------------------------------------------------------------------
# A list of documentation file in .md format that should be viewable from the
# web interface. These files are in the 'docs' subdirectory. The first file
# listed will be used as a documentation index and is index.md by convention
documentation_files:
- 'index.md'
# -----------------------------------------------------------------------------
# NEXTFLOW WORKFLOW CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The workflow file is always named 'workflow/main.f'
# The workflow must publish all final output into $baseDir
# A list of clueter environment modules that this workflow requires to run.
# Specify versioned module names to ensure reproducability.
workflow_modules:
- 'BWA/0.7.5'
- 'picard/1.127'
- 'macs/1.4.2'
# A list of parameters used by the workflow, defining how to present them,
# options etc in the web interface. For each parameter:
#
# REQUIRED INFORMATION
# id: The name of the parameter in the NEXTFLOW workflow
# type: The type of the parameter, one of:
# string - A free-format string
# integer - An integer
# real - A real number
# file - A single file from user data
# files - One or more files from user data
# select - A selection from a list of values
# required: true/false, must the parameter be entered/chosen?
# description: A user friendly description of the meaning of the parameter
#
# OPTIONAL INFORMATION
# default: A default value for the parameter (optional)
# min: Minium value/characters/files for number/string/files types
# max: Maxumum value/characters/files for number/string/files types
# regex: A regular expression that describes valid entries / filenames
#
# SELECT TYPE
# choices: A set of choices presented to the user for the parameter.
# Each choice is a pair of value and description, e.g.
#
# choices:
# - [ 'myval', 'The first option']
# - [ 'myval', 'The second option']
#
# NOTE - All parameters are passed to NEXTFLOW as strings... but they
# are validated by astrocyte using the information provided above
workflow_parameters:
- id: fastq
type: files
required: true
description: |
One or more input FASTQ files from a ChIPSeq experiment
regex: ".*(fastq|fq)"
min: 1
- id: index
type: select
choices:
- [ '/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa', 'UCSC hg19']
- [ '/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg18/Sequence/BWAIndex/genome.fa', 'UCSC hg18']
required: true
description: |
Reference genome for BWA alignment
# -----------------------------------------------------------------------------
# SHINY APP CONFIGURATION
# -----------------------------------------------------------------------------
# Remember - The vizapp is always 'vizapp/server.R' 'vizapp/ui.R'
# The workflow must publish all final output into $baseDir
# Name of the R module that the vizapp will run against
vizapp_r_module: 'R/3.2.1-Intel'
# List of any CRAN packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_cran_packages:
- shiny
- shinyFiles
# # List of any Bioconductor packages, not provided by the modules, that must be made
# available to the vizapp
vizapp_bioc_packages:
- chipseq
\ No newline at end of file
# Astrocyte ChIPSeq Example
This workflow carries out a simple ChIPSeq alignment and peak calling using
*BWA* and *MACS 1.4*. One or more FASTQ files containing reads from a ChIPSeq
experiment can be selected as input. For each file this workflow:
1. Aligns the reads to a selected genomic reference using BWA aln.
2. Converts BWA's native output into SAM format.
3. Sorts and indexes the SAM file, and converts into binary BAM format using
Picard.
4. Performs ChIPSeq peak calling using MACS 1.4, with simple `--no-model` and
`--single-profile` options. Wig files are produced as well as standard
spreadsheet output.
## Workflow Parameters
* **fastq** - Choose one or more ChIPSeq read files to process. All should be
ChIP files - i.e. there is no control. Each file will be processed as an
independent sample.
* **index** - Choose a genomic index to use as a reference for alignment of
ChIPSeq reads. A variety of options are available for human and murine
samples.
## Visualization App
The example visualization app demonstrates integration of Shiny into astrocyte
by implementing a simple file chooser that access the output of the workflow.
## Test Data
The test data directory of this workflow package includes a subset of reads from
Chr19 for a CTCF ChIP in a G1E cell line.
Originally made available as example data for the Galaxy ChIP-Seq exercises
at [https://usegalaxy.org/u/james/p/exercise-chip-seq]
## Credits
This example worklow is derived from original scripts kindly contributed by the
Xu lab, Children's Research Instiute at UT Southwestern.
# This example implements a simple file browser for accessing results.
library(shiny)
library(shinyFiles)
# Results are available in the directory specified by the outputDir environment
# variable, red by Sys.getenv
rootdir <- Sys.getenv('outputDir')
shinyServer(function(input, output, session) {
# The backend for a simple file chooser, restricted to the
# rootdir we obtained above.
# See https://github.com/thomasp85/shinyFiles
shinyFileChoose(input, 'files', roots=c('workflow'=rootdir), filetypes=c('', 'bed', 'xls','wig'), session=session)
})
library(shiny)
library(shinyFiles)
shinyUI(fluidPage(
verticalLayout(
# Application title
titlePanel("Astrocyte Example"),
wellPanel(
helpText("This is a minimal example, demonstrating how
a Shiny visualization application can access the output of a workflow.
Here we provide a file browser using the shinyFiles package. Real
Astrocyte vizapps would provide custom methods to access and visualize
output."),
helpText("The workflow output is in the directory set in the
outputDir environment variable. this can be retrieved in R with the
command Sys.getenv('outputDir')"),
# A simple file browser within the workflow output directory
# See https://github.com/thomasp85/shinyFiles
shinyFilesButton('files', label='Browse workflow output', title='Please select a file', multiple=FALSE)
)
)
))
\ No newline at end of file
/*
* Copyright (c) 2016. The University of Texas Southwestern Medical Center
*
* This file is part of the BioHPC Workflow Platform
*
* Example ChIP-Seq analysis script, demonstrating the BioHPC Workflow Platform
*
* @authors
* David Trudgian <David.Trudgian@UTSouthwestern.edu>
*
*/
// Path to an input file, or a pattern for multiple inputs
// Note - $baseDir is the location of this workflow file main.nf
params.fastq = "$baseDir/../test_data/*.fastq"
// Path to the BWA Index (.fa file) that we are using for the analysis
params.index = "/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa"
// First, get the list of fastqs, might be using a pattern matching multiple
fastqs = Channel.fromPath(params.fastq)
// Now find the path to the BWA index directory
index_path = file(params.index).parent
// And get the name of the actual index inside that directory
index_name = file(params.index).name
// bwa_aln
// Run BWA aln on a fastq file, to produce sai output
//
// Input - fastq_file is taken from the fastq channel
// - BWA index at $index_path/$index_name
//
// Output - pair of fastq & generated sai file into the alignments channel
process bwa_aln {
// Tell Nextflow we will use 32 cpus here for BWA
cpus 32
input:
file fastq_file from fastqs
output:
set file(fastq_file), file("${fastq_file.name}.sai") into alignments
"""
module load BWA/0.7.5
bwa aln $index_path/$index_name -t 32 $fastq_file > "${fastq_file.name}.sai"
"""
}
// bwa_aln
// Run bwa samse to produce sam.gz from an sai alignment
//
// Input - pair of fastq file and corresponding sai file, from alignments channel
//
// Output - .sam.gz into the samfiles channel, and baseDir/output
process bwa_samse {
// bwa samse will use a single cpu core
cpu 1
// Publish the outputs we create here into the workflow output directory
publishDir "$baseDir/output", mode: 'copy'
input:
set file(fastq_file), file(sai_file) from alignments
output:
file "${fastq_file.name}.sam.gz" into samfiles
"""
module load BWA/0.7.5
bwa samse -r "@RG\tID:${fastq_file.name}\tLB:${fastq_file.name}\tSM:${fastq_file.name}\tPL:ILLUMINA" $index_path/$index_name\
$sai_file $fastq_file | gzip > "${fastq_file.name}.sam.gz"
"""
}
// sam2bam
// Convert SAM file to BAM file, sorting by co-ordinate and indexing
//
// Input - a sam file, (possibly gzipped) from the samfile channel
//
// Output - .sam.gz into the samfiles channel, and baseDir/output
process sam2bam {
// Tell Nextflow picard will only use one cpu.
// We are allocating 32GB to java though, so tell
// Nextflow so it can assign the task appropriately.
cpus 1
memory '32GB'
// Publish the outputs we create here into the workflow output directory
publishDir "$baseDir/output", mode: 'copy'
input:
file sam_file from samfiles
output:
file "${sam_file.name}.bam" into bamfiles
"""
module add picard/1.127
java -Xmx32G -jar \$PICARD/picard.jar SortSam \
INPUT="${sam_file}" \
OUTPUT="${sam_file.name}.bam" \
SORT_ORDER=coordinate \
VALIDATION_STRINGENCY=LENIENT \
CREATE_INDEX=true
"""
}
// macs
// Peak calling on a bam using MACS 1.4
//
// Input - a bam file, from the bamfiles channel
//
// Output - various wig and bed into baseDir/output
process macs14 {
// Publish the outputs we create here into the workflow output directory
publishDir "$baseDir/output", mode: 'copy'
input:
file bam_file from bamfiles
output:
file "${bam_file}_bwa_nomodel_MACS_wiggle"
file "${bam_file}_bwa_nomodel_peaks.bed"
file "${bam_file}_bwa_nomodel_peaks.xls"
file "${bam_file}_bwa_nomodel_summits.bed"
"""
module add macs/1.4.2
macs14 -t ${bam_file} \
--name ${bam_file}_bwa_nomodel \
--nomodel \
--wig \