12.4 KB
Newer Older
David Trudgian's avatar
David Trudgian committed
1 2
# Astrocyte Example Workflow Package

David Trudgian's avatar
David Trudgian committed
3 4 5 6

David Trudgian's avatar
David Trudgian committed
7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177
This is an example workflow package for the BioHPC Astrocyte workflow engine.
Astrocyte is a system allowing workflows to be run easily from the web in a
push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte
allows users to access this workflow package using a simple web interface,
created automatically from the definitions in this package.

## This Example Package

This workflow package provides:

  1) A sample ChIP-Seq data analysis workflow, which uses BWA to align reads to
   a reference genome, and MACS to call peaks. The workflow is written in the
   [*Nextflow*]( workflow language. *Nextflow* is a
   simple yet powerful workflow scripting language based on the *Groovy*
   scripting language. It supports advanced features such as implicit
   parallelization on the cluster - Nextflow will launch concurrent jobs for
   each input file. 
  2) A sample *Shiny* visualization app, which provides a web-based tool for
  visualizing results. *Shiny* is a framework to provide web interfaces to
  data and analysis implemented in the *R* statistical language. *R* is a
  powerful language for manipulating and interrogating data, and *Shiny* allows
  analysis in R to be presented simply and easily as a web application.

  3) Meta-data describing the workflow, it's inputs, output etc. The Astrocyte
  web application and command-line runner use this meta-data to understand the
  workflow, what input it needs, how the documentation is arranged etc.
  4) User-focused documentation, in *markdown* format, that will be displayed to
  users in the Astrocyte web interface. Markdown is a simple plain-text based
  syntax which is especially suited for writing documentation that will be
  displayed on the web.
  5) Developer-focused documentation, in this file - ``. This
  documentation should summmarize features of the workflow package that are of
  interest to anyone who would want to extend it, or use it as a template for
  their own work.
## Workflow Package Layout

Workflow packages for Astrocyte are Git repositories, and have a common layout
which must be followed so that Astrocyte understands how to present them to
users. The folder structure, and names of key files listed below should not be
changed. Although a workflow package with a modified structure may work, it is
not guaranteed to be accepted by future versions of Astrocyte.

The following structure of files and directories is always present:


   - docs/
   - test_data/ 
   - vizapp/
   - workflow/
       - lib/
       - output/
       - scripts/


### Meta-Data

  * `astrocyte_pkg.yml` - A file in the root directory of the package, which 
  contains the metadata describing the workflow in human & machine readable text
  format called *YAML*. This includes information about the workflow package
  such as it's name, synopsis, input parameters, outputs etc.

  See the documentation inside the example `astrocyte_pkg.yml` file for a
  guide to specifying Astrocyte metadata.

### The Workflow
  * `workflow/` - A *Nextflow* workflow file, which will be run by
  Astrocyte using parameters provided by the user.
  * `workflow/scripts` - A directory for any scripts (e.g. bash, python, 
  ruby scripts) that the `` workflow will call. This might be empty if
  the workflow is implemented entirely in nextflow. You should *not* include
  large pieces of software here. Workflows should be designed to use *modules*
  available on the BioHPC cluster. The modules a workflow needs will be defined
  in the `astrocyte_pkg.yml` metadata file.
  * `workflow/lib` - A directory for any netflow/groovy libraries that might be
  included by workflows using advanced features. Usually empty for simpler
  * `workflow/output` - An empty directory, into which an final output of the
  workflow should be published using the `publishDir "$baseDir/output", mode: 'copy'`
  directive inside a process.

  To learn about the *Nextflow* language, take a look at this and other example
  workflows, and refer to the []( website.

  Nextflow workflows used in an Astrocyte package must be written in a certain
  way, with specific rules so that Astrocyte can run them successfully on the
  cluster. See the *Workflow Requirements* section below for details.

### The Visualization App *(Optional)*

  * `vizapp/` - A directory that will contain an *R Shiny* visualization app, if
   required. The vizualization app will be made available to the user via the
   Astrocyte web interface. At minimum the directory requires the standard Shiny
  `ui.R` and `server.R` files. The exact Shiny app structure is not 
  prescribed. Any R packages required by the Shiny app will be listed in the
  `astrocyte_pkg.yml` metadata.

  Shiny apps used in an Astrocyte package must be written in a certain
  way, with specific rules so that Astrocyte can run them successfully, and find
  data files to visualize. See the *Vizapp Requirements* section below for

### User Documentation 

  * `docs/` - The first page of user documentation, in *markdown*
  format. Astrocyte will display this documentation to users of the workflow

  * `docs/...` - Any other documentation files. *Markdown* `.md` files will be
  rendered for display on the web. Any images used in the documentation should
  also be placed here.

### Developer Documentation

  * `` - Documentation for developers of the workflow giving a brief
  overview and any important notes that are not for workflow users.
  * `` *(Optional)* - The license applied to the workflow package.
  * `` - A brief summary of changes made through time to the workflow.

### Testing

  * `test_data/` - Every workflow package should include a minimal set of test
  data that allows the workflow to be run, testing its features. The
  `test_data/` directory is a location for test data files. Test data should be
  kept as small as possible. If large datasets (over 20MB total) are unavoidable
  provide a `` bash script which obtains the data from an
  external source.
  * `test_data/` *Optional* - A bash script that fetches large
  test data from an external source, placing it into the `test_data/` directory.

# Workflow Requirements & Testing

So that Astrocyte can successfully run a workflow for any BioHPC user, and
make efficient use of the Nucleus compute cluster, the Nextflow workflow must
be written according to some rules.

## Astrocyte / Nextflow Basics

A Nextflow workflow run by astrocyte must be in a file named `workflow/`
within the workflow package. Do not use other filenames for your workflow.

When a workflow runs on the Astrocyte platform the work area will be created
dynamically and its path will not be known in advance. The `$baseDir`
variable can be used inside the Nextflow workflow to refer to this path.

Data files for analysis will be uploaded or linked into Astrocyte by users. 
Workflows cannot access input data directly. **All input file names must be
accepted as workflow parameters**. Astrocyte will allow users to select input
files, and pass the parameter values to your workflow. 

Output files for the user should be published to `$baseDir/output` using the nextflow
directive `publishDir "$baseDir/output", mode: 'copy'` in a process block.
David Trudgian's avatar
David Trudgian committed
179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304
You can create a directory structure inside `$baseDir/output` if required to
organize the output files. Note that we use the 'copy' mode so that Nextflow's
work directories can be cleaned up by Astrocyte. By default Nextflow will link
published files from work directories, so output would be lost on cleanup if
`mode: copy` is not used.

Reference data paths, even in permanent locations such as 
`/project/apps_database/iGenomes` should not be hard-coded into workflows.\
Provide parameters in your workflow for specifying reference files, and then
specify possible choices for those parameters in the `astrocyte_pkg.yml` 
metadata file. If you need to make reference data available for a workflow,
please ask the BioHPC team to place it in the central `/project/apps_database`

The `workflow/scripts` directory in a package is reserved for any scripts, 
e.g. Perl, Python, Bash scripts which implement processing that you don't want
to write orre-write in Nextflow. They should be called from `` as a
Nextflow *process*. The path to the scripts directory will be `$baseDir/scripts` 
and this should be used instead of other relative or absolute paths within your
workflow. Don't put large, or even small applications here. Please use cluster
modules within your workflow, and ask BioHPC to install software as a module if
you need it.

## Optimizations

To make sure that your workflow can be run efficiently on the BioHPC cluster

  * **Do** specify cpu and memory requirements for Nextflow processes using the
  `cpus` and `memory` directives. Nextflow will use this information to schedule
  tasks and complete a job as quickly as possible.

  * **Do** split complex sections of a workflow into multiple Nextflow processes
  so that they can be parallelized by the system.

  * **Do** use software modules from the cluster, and specify exact versions for
  modules when loading them in your workflow.

  * **Do** thorougly check the execution of your workflow using the command line
  runner, before attempting to bring it into the Astrocyte web application

  * **Don't** use absolute paths for any files - see above.

  * **Don't** work with any files outside of `$baseDir`, except permanent
  reference data administered by BioHPC.

# Vizapp / Shiny Requirements

The visualization app will have access to any final output that was published
to the `$baseDir/output` location in the nextflow workflow. This path will be
accessible as `Sys.getenv('outputDir')`.

Parameters specified when the workflow was launched will also be available as
environment variables - their name prefixed by `param-` e.g. the workflow
parameter `fastqs` will be available in R as `Sys.getenv('param-fastqs')`.

  * **Don't** try to access any files outside of `outputDir` except permanent
  reference data administered by biohpc.

  * **Do** list any CRAN or Bioconductor packages needed by the vizapp in the
  `astrocyte_pkg.yml` metadata file.

  * **Don't** do heavy processing in the vizapp. Shiny apps in Astrocyte share
  resources, and are intended for basic visualization. You may wish to provide
  instructions to users in your workflow package documentation directing them
  to RStudio for follow-up work. Moderate I/O (e.g. scanning a large reference
  file) is acceptable, as BioHPC systems have access to high performance

# Testing/Running the Workflow with the Astrocyte CLI

**Work in Progress - CLI not yet available on the cluster**

Workflows will usually be run from the Astrocyte web interface, by importing the
workflow package repository and making it available to users. During development
You can use the Astrocyte CLI scripts to check, test, and run your workflow
against non-test data.

First load the `astrocyte` module on a biohpc system.

To check the structure and syntax of the workflow package in the directory

$ astrocyte_cli check astrocyte_example_chipseq

To launch the workflows defined tests, against included test data:

$ astrocyte_cli test astrocyte_example_chipseq

To run the workflow using specific data and parameters. A working directory will
be created.

$ astrocyte_cli run astrocyte_example_chipseq --parameter1 "value1" --parameter2 "value2"...

To run the Shiny vizualization app against test_data

$ astrocyte_cli shinytest astrocyte_example_chipseq

To run the Shiny vizualization app against output from `astrocyte_cli run`,
which will be in the work directory created by `run`:

$ astrocyte_cli shiny astrocyte_example_chipseq

To generate the user-facing documentation for the workflow and display it in a
web browser:

$ astrocyte_cli docs astrocyte_example_chipseq