Commit efbfda57 authored by David Trudgian's avatar David Trudgian
Browse files

Update readme

parent d583db67
......@@ -4,19 +4,24 @@
[![coverage report](https://git.biohpc.swmed.edu/biohpc/param_runner/badges/master/coverage.svg)](https://git.biohpc.swmed.edu/biohpc/param_runner/commits/master)
### STATUS - Development (not yet functional)
### STATUS - Alpha version, testing with initial users.
## Introduction
The BioHPC `param_runner` is a command line tool to run a command multiple times, exploring a defined parameter space, summarizing results.
The BioHPC `param_runner` is a command line tool to run a command multiple
times, exploring a defined parameter space, summarizing results.
This tool uses a simple YAML configuration file to define a parameter space to exhaustively search, and runs tasks in parallel by distributing them over a set of allocated nodes on the BioHPC Nucleus cluster. Supported parameter spaces are:
This tool uses a simple YAML configuration file to define a parameter space to
exhaustively search, and runs tasks in parallel by distributing them over a set
of allocated nodes on the BioHPC Nucleus cluster. Supported parameter spaces are:
* Arithmetic progressions of integers or real numbers.
* Geometric progressions of integers or real numbers.
* Defined lists of strings or other values.
The output of a command run with a certain parameter combination can be captured for summary in tabular format by defining regular expressions matching output.
The output of a command run with a certain parameter combination can be captured
for summary in tabular format by defining regular expressions matching against
the output of the command.
## Using the Parameter Runner on the Nucleus Cluster
......@@ -26,32 +31,99 @@ The output of a command run with a certain parameter combination can be captured
3. Check the parameter .yml file: `param_runner check myexperiment.yml`
4. Submit to the cluster: `param_runner submit myexperiment.yml`
Output from individual commands run will be stored in a 'param_output/' subdirectory.
Summary information from the commands execute will be tabulated according to
your summary configuration, into the file 'param_output/param_summary.csv'.
When `param_runner submit` is run it will submit a job to the cluster to perform
the parameter optimization. The job number will be reported on the command line:
```
...
INFO Submitted batch job 290006
...
```
You can monitor the status of the job using the `squeue` command, or the BioHPC
web portal. When the job starts to run it will create a log file in the
directory it was launched from, named according to the job ID that was reported
above.
```
param_runner_290008.out
```
You can examine this file with commands such as `cat` or `less`. Look for a line
stating the output directory for the optimization:
```
...
- Trace file and output will be at: test/test_data/param_runner_20170111-12131484158381
...
```
Inside this directory you will find the master `trace.txt` file that summarizes
the parameter exploration. There are also many xxx.out and xxx.err file which
contain the full standard and error output from each individual command that was
run throughout the parameter exploration, e.g:
```
...
100.err
100.out
101.err
101.out
102.err
102.out
103.err
103.out
104.err
104.out
105.err
105.out
...
```
The number in the filename corresponds to the index column in the master summary
`trace.txt`.
## Parameter File Format
The parameter file that defines the parameter exploration to be run is written
in a simple text based format called YAML. A simple introduction to YAML can be
found at the link below, but it should be easy to work from the example
parameter file given later.
https://learnxinyminutes.com/docs/yaml/
Whenever you write a parameter file be sure to `param_runner check` it before
you attempt to submit to the cluster. The check will list any problems with the
parameter file that you should correct before submitting to the cluster.
The example below includes documentation of each parameter, and can be
used as a starting point for your own parameter files
```yaml
# The command to run, including any arguments that will not be explored by
# the runner.
command: train_ann --input train.set --crossk 10
# The command to run, including any arguments that always stay the same,
# and will not be explored by the runner.
# e.g. in this example we run an imaginary program called `train_ann`,
# which could be a program to train an artificial neural network. We
# want to run it with various different training parameters, but the
# input and crossk arguments are always the same. This example is
# typical of a machine learning experiment.
# The standard output from a command will be collected into a file named:
# out_<param1 val>_<param2 val>_<param3_val>.....out
command: train_ann --input train.set --crossk 10
# The standard error from a command will be collected into a file named:
# out_<param1 val>_<param2 val>_<param3_val>.....err
# If summary entries are defined here we will look at the standard
# output each time the command is run, and try to extract result values
# using regular expressions that are provided. Each summary entry has
# an id, which will become a column in our trace.txt summary file.
# The regex supplied must contain a single matching group. The value it
# matches against will be written into the trace.txt file in the
# relevant column.
#
# Here we will pull at TPF and FPF values that our train_ann program
# reports when it has finished training our neural network.
# If summary is specified, we will create a summary.txt file listing in tabular
# format, the value of each parameter, the standard output of the task.
# To extract only part of the standard output specify a regular expression here.
# Any capture groups in parentheses will be collected as columns in the summary
# file.
summary:
- id: True_Pos_Fraction
......@@ -60,6 +132,8 @@ summary:
- id: False_pos_Fraction
regex: 'FPF: ([-+]?[0-9]*\.?[0-9])'
# Now we define cluster options, describing how to run our commands on
# 1 or more nodes on the BioHPC nucleus cluster.
# Cluster partition to use
partition: 256GB
......@@ -67,14 +141,20 @@ partition: 256GB
# Total number of nodes to use
nodes: 4
# Number of CPUs required by each task
# Number of CPUs required by each task. Here we request 4 cpus for each
# task. On the 256GB nodes there are 48 logical cores, so the runner
# can start 12 tasks in parallel per node, for a total of 48 concurrent
# tasks across the 4 nodes we have requested.
cpus_per_task: 4
# Time limit
time_limit: 3d:00:00:00
# Time limit - you must specify a timelimit for the job here, in the
# format <DAYS>-<HOURS>:<MINUTES>
# If the job reaches the limit it will be terminated, so please allow
# a margin of safety. Here we ask for 3 days.
time_limit: 3-00:00
# The list of parameters to explore
# Now we configure the list of parameters we want to explore.
#
# For each parameter the following properties are required:
#
......@@ -86,14 +166,14 @@ time_limit: 3d:00:00:00
# The following properties are optional:
#
# flag: '--example' A flag which should proceed the value of the paremeter
# optional: true If true, we consider combinations excluding the parameter
# optional: true If true, we consider combinations excluding this parameter
#
# int_range and real_range types take paremeters:
#
# min: 10 Minimum value for the parameter
# max: 1e4 Maximum value for the parameter
# step: 20 Amount to add at each step through the range
# multiply: Amount to multiply value at each step through the range
# scale: 10 Amount to multiply value at each step through the range
#
# e.g. for an arithmetic progression of 10 - 100 in steps of 5:
# min: 10
......@@ -104,17 +184,29 @@ time_limit: 3d:00:00:00
# min: 1
# max: 10000
# scale: 10
#
#
# In the example below we explore a total of 170 parameter combinations,
# consisting of every combination of:
# - 16 different numbers of hidden nodes
# - 4 different regularization beta values
# - 2 different activation functions
parameters:
# Even numbers from 2 up to 32 for the value of --hidden
- id: 'hidden'
type: "int_range"
flag: '--hidden'
min: 0
min: 2
max: 32
step: 2
description: "Number of hidden nodes"
# 0.1, 1.0, 10, 100 for the value of --beta
- id: 'beta'
type: real_range
flag: '-b'
......@@ -124,6 +216,8 @@ parameters:
scale: 10
description: Regularization beta parameter
# Wither step or sigmoid for the value of --activation
- id: 'activation'
type: choice
flag: '--activation'
......@@ -131,4 +225,5 @@ parameters:
- step
- sigmoid
description: Activation function
```
\ No newline at end of file
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment