Initial template workflow

94fd351b · David Trudgian · 94fd351b · 94fd351b · 94fd351b · 94fd351b
Commit 94fd351b authored 9 years ago by David Trudgian
--- a/.idea/workspace.xml
+++ b/.idea/workspace.xml
--- a/CHANGES.md
+++ b/CHANGES.md
+2016-02-29
+----------
+
+Initial Version.
--- a/LICENSE.md
+++ b/LICENSE.md
+Copyright © 2016. The University of Texas Southwestern Medical Center
+5323 Harry Hines Boulevard Dallas, Texas, 75390 Telephone 214-648-3111
\ No newline at end of file
--- a/README.md
+++ b/README.md
+# Astrocyte Example Workflow Package
+
+This is an example workflow package for the BioHPC Astrocyte workflow engine.
+Astrocyte is a system allowing workflows to be run easily from the web in a
+push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte
+allows users to access this workflow package using a simple web interface,
+created automatically from the definitions in this package.
+
+## This Example Package
+
+This workflow package provides:
+
+  1) A sample ChIP-Seq data analysis workflow, which uses BWA to align reads to
+   a reference genome, and MACS to call peaks. The workflow is written in the
+   [*Nextflow*](http://www.nextflow.io) workflow language. *Nextflow* is a
+   simple yet powerful workflow scripting language based on the *Groovy*
+   scripting language. It supports advanced features such as implicit
+   parallelization on the cluster - Nextflow will launch concurrent jobs for
+   each input file. 
+  
+  2) A sample *Shiny* visualization app, which provides a web-based tool for
+  visualizing results. *Shiny* is a framework to provide web interfaces to
+  data and analysis implemented in the *R* statistical language. *R* is a
+  powerful language for manipulating and interrogating data, and *Shiny* allows
+  analysis in R to be presented simply and easily as a web application.
+
+  3) Meta-data describing the workflow, it's inputs, output etc. The Astrocyte
+  web application and command-line runner use this meta-data to understand the
+  workflow, what input it needs, how the documentation is arranged etc.
+  
+  4) User-focused documentation, in *markdown* format, that will be displayed to
+  users in the Astrocyte web interface. Markdown is a simple plain-text based
+  syntax which is especially suited for writing documentation that will be
+  displayed on the web.
+  
+  5) Developer-focused documentation, in this file - `README.md`. This
+  documentation should summmarize features of the workflow package that are of
+  interest to anyone who would want to extend it, or use it as a template for
+  their own work.
+  
+## Workflow Package Layout
+
+Workflow packages for Astrocyte are Git repositories, and have a common layout
+which must be followed so that Astrocyte understands how to present them to
+users. The folder structure, and names of key files listed below should not be
+changed. Although a workflow package with a modified structure may work, it is
+not guaranteed to be accepted by future versions of Astrocyte.
+
+The following structure of files and directories is always present:
+
+```
+
+   - docs/
+       index.md
+   - test_data/ 
+   - vizapp/
+       server.R
+       ui.R
+   - workflow/
+       - lib/
+       - output/
+       - scripts/
+       main.nf
+   astrocyte_pkg.yml
+   CHANGES.md
+   LICENSE.md
+   README.md  
+
+```
+
+
+### Meta-Data
+
+  * `astrocyte_pkg.yml` - A file in the root directory of the package, which 
+  contains the metadata describing the workflow in human & machine readable text
+  format called *YAML*. This includes information about the workflow package
+  such as it's name, synopsis, input parameters, outputs etc.
+
+  See the documentation inside the example `astrocyte_pkg.yml` file for a
+  guide to specifying Astrocyte metadata.
+
+
+### The Workflow
+  
+  * `workflow/main.nf` - A *Nextflow* workflow file, which will be run by
+  Astrocyte using parameters provided by the user.
+  * `workflow/scripts` - A directory for any scripts (e.g. bash, python, 
+  ruby scripts) that the `main.nf` workflow will call. This might be empty if
+  the workflow is implemented entirely in nextflow. You should *not* include
+  large pieces of software here. Workflows should be designed to use *modules*
+  available on the BioHPC cluster. The modules a workflow needs will be defined
+  in the `astrocyte_pkg.yml` metadata file.
+  * `workflow/lib` - A directory for any netflow/groovy libraries that might be
+  included by workflows using advanced features. Usually empty for simpler
+  workflows.
+  * `workflow/output` - An empty directory, into which an final output of the
+  workflow should be published using the `publishDir "$baseDir/output", mode: 'copy'`
+  directive inside a process.
+
+  To learn about the *Nextflow* language, take a look at this and other example
+  workflows, and refer to the [nextflow.io](http://www.nextflow.io) website.
+
+  Nextflow workflows used in an Astrocyte package must be written in a certain
+  way, with specific rules so that Astrocyte can run them successfully on the
+  cluster. See the *Workflow Requirements* section below for details.
+
+
+### The Visualization App *(Optional)*
+
+  * `vizapp/` - A directory that will contain an *R Shiny* visualization app, if
+   required. The vizualization app will be made available to the user via the
+   Astrocyte web interface. At minimum the directory requires the standard Shiny
+  `ui.R` and `server.R` files. The exact Shiny app structure is not 
+  prescribed. Any R packages required by the Shiny app will be listed in the
+  `astrocyte_pkg.yml` metadata.
+
+  Shiny apps used in an Astrocyte package must be written in a certain
+  way, with specific rules so that Astrocyte can run them successfully, and find
+  data files to visualize. See the *Vizapp Requirements* section below for
+  details.
+
+
+### User Documentation 
+
+  * `docs/index.md` - The first page of user documentation, in *markdown*
+  format. Astrocyte will display this documentation to users of the workflow
+  package.
+
+  * `docs/...` - Any other documentation files. *Markdown* `.md` files will be
+  rendered for display on the web. Any images used in the documentation should
+  also be placed here.
+
+  
+### Developer Documentation
+
+  * `README.md` - Documentation for developers of the workflow giving a brief
+  overview and any important notes that are not for workflow users.
+  * `LICENSE.md` *(Optional)* - The license applied to the workflow package.
+  * `CHANGES.md` - A brief summary of changes made through time to the workflow.
+
+### Testing
+
+  * `test_data/` - Every workflow package should include a minimal set of test
+  data that allows the workflow to be run, testing its features. The
+  `test_data/` directory is a location for test data files. Test data should be
+  kept as small as possible. If large datasets (over 20MB total) are unavoidable
+  provide a `fetch_test_data.sh` bash script which obtains the data from an
+  external source.
+  * `test_data/fetch_test_data.sh` *Optional* - A bash script that fetches large
+  test data from an external source, placing it into the `test_data/` directory.
+
+
+# Workflow Requirements & Testing
+
+So that Astrocyte can successfully run a workflow for any BioHPC user, and
+make efficient use of the Nucleus compute cluster, the Nextflow workflow must
+be written according to some rules.
+
+## Astrocyte / Nextflow Basics
+
+A Nextflow workflow run by astrocyte must be in a file named `workflow/main.nf`
+within the workflow package. Do not use other filenames for your workflow.
+
+When a workflow runs on the Astrocyte platform the work area will be created
+dynamically and its path will not be known in advance. The `$baseDir`
+variable can be used inside the Nextflow workflow to refer to this path.
+
+Data files for analysis will be uploaded or linked into Astrocyte by users. 
+Workflows cannot access input data directly. **All input file names must be
+accepted as workflow parameters**. Astrocyte will allow users to select input
+files, and pass the parameter values to your workflow. 
+
+Output files for the user should be published to `$baseDir/output` using the nextflow
+directive `publishDir "$baseDir/output, mode: 'copy''"` in a process block.
+You can create a directory structure inside `$baseDir/output` if required to
+organize the output files. Note that we use the 'copy' mode so that Nextflow's
+work directories can be cleaned up by Astrocyte. By default Nextflow will link
+published files from work directories, so output would be lost on cleanup if
+`mode: copy` is not used.
+
+Reference data paths, even in permanent locations such as 
+`/project/apps_database/iGenomes` should not be hard-coded into workflows.\
+Provide parameters in your workflow for specifying reference files, and then
+specify possible choices for those parameters in the `astrocyte_pkg.yml` 
+metadata file. If you need to make reference data available for a workflow,
+please ask the BioHPC team to place it in the central `/project/apps_database`
+area.
+
+The `workflow/scripts` directory in a package is reserved for any scripts, 
+e.g. Perl, Python, Bash scripts which implement processing that you don't want
+to write orre-write in Nextflow. They should be called from `main.nf` as a
+Nextflow *process*. The path to the scripts directory will be `$baseDir/scripts` 
+and this should be used instead of other relative or absolute paths within your
+workflow. Don't put large, or even small applications here. Please use cluster
+modules within your workflow, and ask BioHPC to install software as a module if
+you need it.
+
+
+## Optimizations
+
+To make sure that your workflow can be run efficiently on the BioHPC cluster
+please:
+
+  * **Do** specify cpu and memory requirements for Nextflow processes using the
+  `cpus` and `memory` directives. Nextflow will use this information to schedule
+  tasks and complete a job as quickly as possible.
+
+  * **Do** split complex sections of a workflow into multiple Nextflow processes
+  so that they can be parallelized by the system.
+
+  * **Do** use software modules from the cluster, and specify exact versions for
+  modules when loading them in your workflow.
+
+  * **Do** thorougly check the execution of your workflow using the command line
+  runner, before attempting to bring it into the Astrocyte web application
+
+  * **Don't** use absolute paths for any files - see above.
+
+  * **Don't** work with any files outside of `$baseDir`, except permanent
+  reference data administered by BioHPC.
+
+
+# Vizapp / Shiny Requirements
+
+The visualization app will have access to any final output that was published
+to the `$baseDir/output` location in the nextflow workflow. This path will be
+accessible as `Sys.getenv('outputDir')`.
+
+Parameters specified when the workflow was launched will also be available as
+environment variables - their name prefixed by `param-` e.g. the workflow
+parameter `fastqs` will be available in R as `Sys.getenv('param-fastqs')`.
+
+  * **Don't** try to access any files outside of `outputDir` except permanent
+  reference data administered by biohpc.
+
+  * **Do** list any CRAN or Bioconductor packages needed by the vizapp in the
+  `astrocyte_pkg.yml` metadata file.
+
+  * **Don't** do heavy processing in the vizapp. Shiny apps in Astrocyte share
+  resources, and are intended for basic visualization. You may wish to provide
+  instructions to users in your workflow package documentation directing them
+  to RStudio for follow-up work. Moderate I/O (e.g. scanning a large reference
+  file) is acceptable, as BioHPC systems have access to high performance
+  storage.
+
+ 
+# Testing/Running the Workflow with the Astrocyte CLI
+
+**Work in Progress - CLI not yet available on the cluster**
+
+Workflows will usually be run from the Astrocyte web interface, by importing the
+workflow package repository and making it available to users. During development
+You can use the Astrocyte CLI scripts to check, test, and run your workflow
+against non-test data.
+
+First load the `astrocyte` module on a biohpc system.
+
+To check the structure and syntax of the workflow package in the directory
+`astrocyte_example_chipseq`:
+
+```bash
+$ astrocyte_cli check astrocyte_example_chipseq
+```
+
+To launch the workflows defined tests, against included test data:
+
+```bash
+$ astrocyte_cli test astrocyte_example_chipseq
+```
+
+To run the workflow using specific data and parameters. A working directory will
+be created.
+
+```bash
+$ astrocyte_cli run astrocyte_example_chipseq --parameter1 "value1" --parameter2 "value2"...
+```
+
+To run the Shiny vizualization app against test_data
+
+```bash
+$ astrocyte_cli shinytest astrocyte_example_chipseq
+```
+
+To run the Shiny vizualization app against output from `astrocyte_cli run`,
+which will be in the work directory created by `run`:
+
+```bash
+$ astrocyte_cli shiny astrocyte_example_chipseq
+```
+
+To generate the user-facing documentation for the workflow and display it in a
+web browser:
+
+```bash
+$ astrocyte_cli docs astrocyte_example_chipseq
+```
+
+
+
+
--- a/astrocyte_pkg.yml
+++ b/astrocyte_pkg.yml
+#
+# metadata for the example astrocyte ChipSeq workflow package
+#
+
+# -----------------------------------------------------------------------------
+# BASIC INFORMATION
+# -----------------------------------------------------------------------------
+
+# A unique identifier for the workflow package, text/underscores only
+name: 'astrocyte_example'
+# Who wrote this?
+author: 'David Trudgian'
+# A contact email address for questions
+email: 'biohpc-help@utsouthwestern.edu'
+# A more informative title for the workflow package
+title: 'Astrocyte Example ChIPSeq Workflow'
+# A summary of the workflow package in plain text
+description: |
+  This is an example workflow package for the BioHPC astrocyte workflow system.
+  It implements a simple ChIPSeq analysis workflow using BWA and MACS, plus a
+  simple R Shiny visualization application.
+
+# -----------------------------------------------------------------------------
+# DOCUMENTATION
+# -----------------------------------------------------------------------------
+
+# A list of documentation file in .md format that should be viewable from the
+# web interface. These files are in the 'docs' subdirectory. The first file
+# listed will be used as a documentation index and is index.md by convention
+documentation_files:
+  - 'index.md'
+
+# -----------------------------------------------------------------------------
+# NEXTFLOW WORKFLOW CONFIGURATION
+# -----------------------------------------------------------------------------
+
+# Remember - The workflow file is always named 'workflow/main.f'
+#            The workflow must publish all final output into $baseDir
+
+# A list of clueter environment modules that this workflow requires to run.
+# Specify versioned module names to ensure reproducability.
+workflow_modules:
+  - 'BWA/0.7.5'
+  - 'picard/1.127'
+  - 'macs/1.4.2'
+
+# A list of parameters used by the workflow, defining how to present them,
+# options etc in the web interface. For each parameter:
+#
+# REQUIRED INFORMATION
+#  id:         The name of the parameter in the NEXTFLOW workflow
+#  type:       The type of the parameter, one of:
+#                string    - A free-format string
+#                integer   - An integer
+#                real      - A real number
+#                file      - A single file from user data
+#                files     - One or more files from user data
+#                select    - A selection from a list of values
+#  required:    true/false, must the parameter be entered/chosen?
+#  description: A user friendly description of the meaning of the parameter
+#
+# OPTIONAL INFORMATION
+#  default:   A default value for the parameter (optional)
+#  min:       Minium value/characters/files for number/string/files types
+#  max:       Maxumum value/characters/files for number/string/files types
+#  regex:     A regular expression that describes valid entries / filenames
+#
+# SELECT TYPE
+#  choices:   A set of choices presented to the user for the parameter.
+#             Each choice is a pair of value and description, e.g.
+# 
+#             choices:
+#               - [ 'myval', 'The first option']
+#               - [ 'myval', 'The second option']
+#
+# NOTE - All parameters are passed to NEXTFLOW as strings... but they
+#        are validated by astrocyte using the information provided above
+
+workflow_parameters:
+
+  - id: fastq
+    type: files
+    required: true
+    description: |
+      One or more input FASTQ files from a ChIPSeq experiment
+    regex: ".*(fastq|fq)"
+    min: 1
+
+  - id: index
+    type: select
+    choices:
+      - [ '/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa', 'UCSC hg19']
+      - [ '/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg18/Sequence/BWAIndex/genome.fa', 'UCSC hg18']
+    required: true
+    description: |
+      Reference genome for BWA alignment
+
+# -----------------------------------------------------------------------------
+# SHINY APP CONFIGURATION
+# -----------------------------------------------------------------------------
+
+# Remember - The vizapp is always 'vizapp/server.R' 'vizapp/ui.R'
+#            The workflow must publish all final output into $baseDir
+
+# Name of the R module that the vizapp will run against
+vizapp_r_module: 'R/3.2.1-Intel'
+
+# List of any CRAN packages, not provided by the modules, that must be made
+# available to the vizapp
+vizapp_cran_packages:
+  - shiny
+  - shinyFiles
+
+# # List of any Bioconductor packages, not provided by the modules, that must be made
+# available to the vizapp
+vizapp_bioc_packages:
+  - chipseq
\ No newline at end of file
--- a/docs/index.md
+++ b/docs/index.md
+# Astrocyte ChIPSeq Example
+
+This workflow carries out a simple ChIPSeq alignment and peak calling using 
+*BWA* and *MACS 1.4*. One or more FASTQ files containing reads from a ChIPSeq
+experiment can be selected as input. For each file this workflow:
+
+  1. Aligns the reads to a selected genomic reference using BWA aln.
+
+  2. Converts BWA's native output into SAM format.
+
+  3. Sorts and indexes the SAM file, and converts into binary BAM format using 
+  Picard.
+
+  4. Performs ChIPSeq peak calling using MACS 1.4, with simple `--no-model` and 
+  `--single-profile` options. Wig files are produced as well as standard
+  spreadsheet output.
+
+
+## Workflow Parameters
+
+  * **fastq** - Choose one or more ChIPSeq read files to process. All should be
+  ChIP files - i.e. there is no control. Each file will be processed as an
+  independent sample.
+
+  * **index** - Choose a genomic index to use as a reference for alignment of
+  ChIPSeq reads. A variety of options are available for human and murine
+  samples.
+
+## Visualization App
+
+The example visualization app demonstrates integration of Shiny into astrocyte
+by implementing a simple file chooser that access the output of the workflow.
+
+## Test Data
+
+The test data directory of this workflow package includes a subset of reads from
+Chr19 for a CTCF ChIP in a G1E cell line.
+
+Originally made available as example data for the Galaxy ChIP-Seq exercises
+at [https://usegalaxy.org/u/james/p/exercise-chip-seq]
+
+
+## Credits
+
+This example worklow is derived from original scripts kindly contributed by the
+Xu lab, Children's Research Instiute at UT Southwestern.
+
+
--- a/test_data/.keep
+++ b/test_data/.keep
--- a/vizapp/.keep
+++ b/vizapp/.keep
--- a/vizapp/server.R
+++ b/vizapp/server.R
+# This example implements a simple file browser for accessing results.
+
+library(shiny)
+library(shinyFiles)
+
+# Results are available in the directory specified by the outputDir environment
+# variable, red by Sys.getenv
+
+rootdir <- Sys.getenv('outputDir')
+
+
+shinyServer(function(input, output, session) {
+
+    # The backend for a simple file chooser, restricted to the
+    # rootdir we obtained above.
+    # See https://github.com/thomasp85/shinyFiles
+
+    shinyFileChoose(input, 'files', roots=c('workflow'=rootdir), filetypes=c('', 'bed', 'xls','wig'), session=session)
+
+})
--- a/vizapp/ui.R
+++ b/vizapp/ui.R
+library(shiny)
+library(shinyFiles)
+
+
+shinyUI(fluidPage(
+
+  verticalLayout(
+
+    # Application title
+    titlePanel("Astrocyte Example"),
+
+    wellPanel(
+
+        helpText("This is a minimal example, demonstrating how
+        a Shiny visualization application can access the output of a workflow.
+        Here we provide a file browser using the shinyFiles package. Real
+        Astrocyte vizapps would provide custom methods to access and visualize
+        output."),
+
+        helpText("The workflow output is in the directory set in the
+        outputDir environment variable. this can be retrieved in R with the
+        command Sys.getenv('outputDir')"),
+
+        # A simple file browser within the workflow output directory
+        # See https://github.com/thomasp85/shinyFiles
+        shinyFilesButton('files', label='Browse workflow output', title='Please select a file', multiple=FALSE)
+
+    )
+  )
+))
\ No newline at end of file
--- a/workflow/lib/.keep
+++ b/workflow/lib/.keep
--- a/workflow/main.nf
+++ b/workflow/main.nf
+/*
+ * Copyright (c) 2016. The University of Texas Southwestern Medical Center
+ *
+ *   This file is part of the BioHPC Workflow Platform
+ *
+ * Example ChIP-Seq analysis script, demonstrating the BioHPC Workflow Platform
+ *
+ * @authors
+ * David Trudgian <David.Trudgian@UTSouthwestern.edu>
+ *
+ */
+
+// Path to an input file, or a pattern for multiple inputs
+// Note - $baseDir is the location of this workflow file main.nf
+params.fastq = "$baseDir/../test_data/*.fastq"
+// Path to the BWA Index (.fa file) that we are using for the analysis
+params.index = "/project/apps_database/iGenomes/Homo_sapiens/UCSC/hg19/Sequence/BWAIndex/genome.fa"
+
+// First, get the list of fastqs, might be using a pattern matching multiple
+fastqs = Channel.fromPath(params.fastq)
+// Now find the path to the BWA index directory
+index_path = file(params.index).parent
+// And get the name of the actual index inside that directory
+index_name = file(params.index).name
+
+// bwa_aln
+// Run BWA aln on a fastq file, to produce sai output
+//
+// Input   - fastq_file is taken from the fastq channel
+//         - BWA index at $index_path/$index_name
+//
+// Output  - pair of fastq & generated sai file into the alignments channel 
+process bwa_aln {
+
+    // Tell Nextflow we will use 32 cpus here for BWA
+    cpus 32
+
+    input:
+    file fastq_file from fastqs
+     
+    output:
+    set file(fastq_file), file("${fastq_file.name}.sai") into alignments
+
+ 
+    """
+    module load BWA/0.7.5
+    bwa aln $index_path/$index_name -t 32 $fastq_file > "${fastq_file.name}.sai"
+    """
+}
+
+// bwa_aln
+// Run bwa samse to produce sam.gz from an sai alignment
+//
+// Input   - pair of fastq file and corresponding sai file, from alignments channel
+//
+// Output  - .sam.gz into the samfiles channel, and baseDir/output
+process bwa_samse {
+
+    // bwa samse will use a single cpu core
+    cpu 1
+
+    // Publish the outputs we create here into the workflow output directory
+    publishDir "$baseDir/output", mode: 'copy'
+    
+    input:
+    set file(fastq_file), file(sai_file) from alignments 
+
+    output:
+    file "${fastq_file.name}.sam.gz" into samfiles
+ 
+    """
+    module load BWA/0.7.5
+    bwa samse -r "@RG\tID:${fastq_file.name}\tLB:${fastq_file.name}\tSM:${fastq_file.name}\tPL:ILLUMINA" $index_path/$index_name\
+    $sai_file $fastq_file | gzip > "${fastq_file.name}.sam.gz"
+    """
+}
+
+// sam2bam
+// Convert SAM file to BAM file, sorting by co-ordinate and indexing
+//
+// Input   - a sam file, (possibly gzipped) from the samfile channel
+//
+// Output  - .sam.gz into the samfiles channel, and baseDir/output
+process sam2bam {
+
+
+    // Tell Nextflow picard will only use one cpu.
+    // We are allocating 32GB to java though, so tell
+    // Nextflow so it can assign the task appropriately.
+    cpus 1
+    memory '32GB'
+
+    // Publish the outputs we create here into the workflow output directory
+    publishDir "$baseDir/output", mode: 'copy'
+    
+    input:
+    file sam_file from samfiles
+
+    output:
+    file "${sam_file.name}.bam" into bamfiles
+ 
+    """
+    module add picard/1.127
+    java -Xmx32G -jar \$PICARD/picard.jar SortSam \
+    INPUT="${sam_file}" \
+    OUTPUT="${sam_file.name}.bam" \
+    SORT_ORDER=coordinate \
+    VALIDATION_STRINGENCY=LENIENT \
+    CREATE_INDEX=true
+    """
+}
+
+// macs
+// Peak calling on a bam using MACS 1.4
+//
+// Input   - a bam file, from the bamfiles channel
+//
+// Output  - various wig and bed into baseDir/output 
+process macs14 {
+
+    // Publish the outputs we create here into the workflow output directory
+    publishDir "$baseDir/output", mode: 'copy'
+    
+    input:
+    file bam_file from bamfiles
+
+    output:
+    file "${bam_file}_bwa_nomodel_MACS_wiggle"
+    file "${bam_file}_bwa_nomodel_peaks.bed"
+    file "${bam_file}_bwa_nomodel_peaks.xls"
+    file "${bam_file}_bwa_nomodel_summits.bed"
+ 
+    """
+    module add macs/1.4.2
+    macs14 -t ${bam_file} \
+      --name ${bam_file}_bwa_nomodel \
+      --nomodel \
+      --wig \
+      --single-profile \
+      -f BAM
+    """
+}
--- a/workflow/output/.keep
+++ b/workflow/output/.keep
--- a/workflow/scripts/.keep
+++ b/workflow/scripts/.keep