Astrocyte Example Workflow Package
This is an example workflow package for the BioHPC Astrocyte workflow engine. Astrocyte is a system allowing workflows to be run easily from the web in a push-button manner, taking advantage of the BioHPC compute cluster. Astrocyte allows users to access this workflow package using a simple web interface, created automatically from the definitions in this package.
This Example Package
This workflow package provides:
-
A sample ChIP-Seq data analysis workflow, which uses BWA to align reads to a reference genome, and MACS to call peaks. The workflow is written in the Nextflow workflow language. Nextflow is a simple yet powerful workflow scripting language based on the Groovy scripting language. It supports advanced features such as implicit parallelization on the cluster - Nextflow will launch concurrent jobs for each input file.
-
A sample Shiny visualization app, which provides a web-based tool for visualizing results. Shiny is a framework to provide web interfaces to data and analysis implemented in the R statistical language. R is a powerful language for manipulating and interrogating data, and Shiny allows analysis in R to be presented simply and easily as a web application.
-
Meta-data describing the workflow, it's inputs, output etc. The Astrocyte web application and command-line runner use this meta-data to understand the workflow, what input it needs, how the documentation is arranged etc.
-
User-focused documentation, in markdown format, that will be displayed to users in the Astrocyte web interface. Markdown is a simple plain-text based syntax which is especially suited for writing documentation that will be displayed on the web.
-
Developer-focused documentation, in this file -
README.md
. This documentation should summarize features of the workflow package that are of interest to anyone who would want to extend it, or use it as a template for their own work.
Workflow Package Layout
Workflow packages for Astrocyte are Git repositories, and have a common layout which must be followed so that Astrocyte understands how to present them to users. The folder structure, and names of key files listed below should not be changed. Although a workflow package with a modified structure may work, it is not guaranteed to be accepted by future versions of Astrocyte.
The following structure of files and directories is always present:
- docs/
index.md
- test_data/
- vizapp/
server.R
ui.R
- workflow/
- lib/
- output/
- scripts/
main.nf
astrocyte_pkg.yml
CHANGES.md
LICENSE.md
README.md
Meta-Data
-
astrocyte_pkg.yml
- A file in the root directory of the package, which contains the metadata describing the workflow in human & machine readable text format called YAML. This includes information about the workflow package such as it's name, synopsis, input parameters, outputs etc.See the documentation inside the example
astrocyte_pkg.yml
file for a guide to specifying Astrocyte metadata.
The Workflow
-
workflow/main.nf
- A Nextflow workflow file, which will be run by Astrocyte using parameters provided by the user. -
workflow/scripts
- A directory for any scripts (e.g. bash, python, ruby scripts) that themain.nf
workflow will call. This might be empty if the workflow is implemented entirely in nextflow. You should not include large pieces of software here. Workflows should be designed to use modules available on the BioHPC cluster. The modules a workflow needs will be defined in theastrocyte_pkg.yml
metadata file. -
workflow/lib
- A directory for any nextflow/groovy libraries that might be included by workflows using advanced features. Usually empty for simpler workflows. -
workflow/output
- An empty directory, into which an final output of the workflow should be published using thepublishDir "$baseDir/output", mode: 'copy'
directive inside a process.To learn about the Nextflow language, take a look at this and other example workflows, and refer to the nextflow.io website.
Nextflow workflows used in an Astrocyte package must be written in a certain way, with specific rules so that Astrocyte can run them successfully on the cluster. See the Workflow Requirements section below for details.
The Visualization App (Optional)
-
vizapp/
- A directory that will contain an R Shiny visualization app, if required. The visualization app will be made available to the user via the Astrocyte web interface. At minimum the directory requires the standard Shinyui.R
andserver.R
files. The exact Shiny app structure is not prescribed. Any R packages required by the Shiny app will be listed in theastrocyte_pkg.yml
metadata.Shiny apps used in an Astrocyte package must be written in a certain way, with specific rules so that Astrocyte can run them successfully, and find data files to visualize. See the Vizapp Requirements section below for details.
User Documentation
-
docs/index.md
- The first page of user documentation, in markdown format. Astrocyte will display this documentation to users of the workflow package. -
docs/...
- Any other documentation files. Markdown.md
files will be rendered for display on the web. Any images used in the documentation should also be placed here.
Developer Documentation
-
README.md
- Documentation for developers of the workflow giving a brief overview and any important notes that are not for workflow users. -
LICENSE.md
(Optional) - The license applied to the workflow package. -
CHANGES.md
- A brief summary of changes made through time to the workflow.
Testing
-
test_data/
- Every workflow package should include a minimal set of test data that allows the workflow to be run, testing its features. Thetest_data/
directory is a location for test data files. Test data should be kept as small as possible. If large datasets (over 20MB total) are unavoidable provide afetch_test_data.sh
bash script which obtains the data from an external source. -
test_data/fetch_test_data.sh
Optional - A bash script that fetches large test data from an external source, placing it into thetest_data/
directory.
Workflow Requirements & Testing
So that Astrocyte can successfully run a workflow for any BioHPC user, and make efficient use of the Nucleus compute cluster, the Nextflow workflow must be written according to some rules.
Astrocyte / Nextflow Basics
A Nextflow workflow run by astrocyte must be in a file named workflow/main.nf
within the workflow package. Do not use other filenames for your workflow.
When a workflow runs on the Astrocyte platform the work area will be created
dynamically and its path will not be known in advance. The $baseDir
variable can be used inside the Nextflow workflow to refer to this path.
Data files for analysis will be uploaded or linked into Astrocyte by users. Workflows cannot access input data directly. All input file names must be accepted as workflow parameters. Astrocyte will allow users to select input files, and pass the parameter values to your workflow.
Output files for the user should be published to $baseDir/output
using the nextflow
directive publishDir "$baseDir/output", mode: 'copy'
in a process block.
You can create a directory structure inside $baseDir/output
if required to
organize the output files. Note that we use the 'copy' mode so that Nextflow's
work directories can be cleaned up by Astrocyte. By default Nextflow will link
published files from work directories, so output would be lost on cleanup if
mode: copy
is not used.
Reference data paths, even in permanent locations such as
/project/apps_database/iGenomes
should not be hard-coded into workflows.
Provide parameters in your workflow for specifying reference files, and then
specify possible choices for those parameters in the astrocyte_pkg.yml
metadata file. If you need to make reference data available for a workflow,
please ask the BioHPC team to place it in the central /project/apps_database
area.
The workflow/scripts
directory in a package is reserved for any scripts,
e.g. Perl, Python, Bash scripts which implement processing that you don't want
to write or re-write in Nextflow. They should be called from main.nf
as a
Nextflow process. The path to the scripts directory will be $baseDir/scripts
and this should be used instead of other relative or absolute paths within your
workflow. Don't put large, or even small applications here. Please use cluster
modules within your workflow, and ask BioHPC to install software as a module if
you need it.
Optimizations
To make sure that your workflow can be run efficiently on the BioHPC cluster please:
-
Do specify cpu and memory requirements for Nextflow processes using the
cpus
andmemory
directives. Nextflow will use this information to schedule tasks and complete a job as quickly as possible. -
Do split complex sections of a workflow into multiple Nextflow processes so that they can be parallelized by the system.
-
Do use containerized software, and specify exact versions. If the containers are home made, provide the recipe files and building script.
-
Do use software modules from the cluster, and specify exact versions for modules when loading them in your workflow.
-
Do thoroughly check the execution of your workflow using the command line runner, before attempting to bring it into the Astrocyte web application
-
Don't use absolute paths for any files - see above.
-
Don't work with any files outside of
$baseDir
, except permanent reference data administered by BioHPC.
Vizapp / Shiny Requirements
The visualization app will have access to any final output that was published
to the $baseDir/output
location in the nextflow workflow. This path will be
accessible as Sys.getenv('outputDir')
.
Parameters specified when the workflow was launched will also be available as
environment variables - their name prefixed by param-
e.g. the workflow
parameter fastqs
will be available in R as Sys.getenv('param-fastqs')
.
-
Don't try to access any files outside of
outputDir
except permanent reference data administered by biohpc. -
Do list any CRAN or Bioconductor packages needed by the vizapp in the
astrocyte_pkg.yml
metadata file. -
Don't do heavy processing in the vizapp. Shiny apps in Astrocyte share resources, and are intended for basic visualization. You may wish to provide instructions to users in your workflow package documentation directing them to RStudio for follow-up work. Moderate I/O (e.g. scanning a large reference file) is acceptable, as BioHPC systems have access to high performance storage.
Testing/Running the Workflow with the Astrocyte CLI
Workflows will usually be run from the Astrocyte web interface, by importing the workflow package repository and making it available to users. During development You can use the Astrocyte CLI scripts to check, test, and run your workflow against non-test data.
First load the astrocyte
module on a biohpc system.
To check the structure and syntax of the workflow package in the directory
astrocyte_example_chipseq
:
astrocyte_cli check astrocyte_example_chipseq
To launch the workflows defined tests, against included test data (Note, check the test_data
directory to see if
it is required to download the test data manually first):
astrocyte_cli test astrocyte_example_chipseq
To run the workflow using specific data and parameters. A working directory will be created.
astrocyte_cli run astrocyte_example_chipseq --parameter1 "value1" --parameter2 "value2"...
To install libraries for the Shiny visualization app:
astrocyte_cli viz-prepare astrocyte_example_chipseq
To run the Shiny visualization app against output from astrocyte_cli run/test
,
which will be in the workflow/output
directory created by run
:
astrocyte_cli viz astrocyte_example_chipseq
To generate the user-facing documentation for the workflow and display it in a web browser:
astrocyte_cli docs astrocyte_example_chipseq