README.md 9.56 KB
Newer Older
David Trudgian's avatar
David Trudgian committed
1
2
# BioHPC param_runner

PLian's avatar
PLian committed
3
4
[![build status](https://git.biohpc.swmed.edu/s190450/param_runner/badges/master/build.svg)](https://git.biohpc.swmed.edu/s190450/param_runner/commits/master)
[![coverage report](https://git.biohpc.swmed.edu/s190450/param_runner/badges/master/coverage.svg)](https://git.biohpc.swmed.edu/s190450/param_runner/commits/master)
David Trudgian's avatar
David Trudgian committed
5

David Trudgian's avatar
David Trudgian committed
6

David Trudgian's avatar
David Trudgian committed
7
### STATUS - Alpha version, testing with initial users.
David Trudgian's avatar
David Trudgian committed
8
9

## Introduction
David Trudgian's avatar
David Trudgian committed
10

11
12
The BioHPC `param_runner` is a command line tool to perform hyperparameter optimization,
 exploring a defined parameter space, summarizing results.
David Trudgian's avatar
David Trudgian committed
13

David Trudgian's avatar
David Trudgian committed
14
15
16
This tool uses a simple YAML configuration file to define a parameter space to
exhaustively search, and runs tasks in parallel by distributing them over a set
of allocated nodes on the BioHPC Nucleus cluster. Supported parameter spaces are:
David Trudgian's avatar
David Trudgian committed
17
18
19
20
21

  * Arithmetic progressions of integers or real numbers.
  * Geometric progressions of integers or real numbers.
  * Defined lists of strings or other values.

David Trudgian's avatar
David Trudgian committed
22
23
24
The output of a command run with a certain parameter combination can be captured
for summary in tabular format by defining regular expressions matching against
the output of the command.
David Trudgian's avatar
David Trudgian committed
25
26


PLian's avatar
PLian committed
27
28
29
30
31
## Install Parameter Runner

1. Download the source code


32
    git clone https://git.biohpc.swmed.edu/s190450/param_runner.git 
PLian's avatar
PLian committed
33
34
35
36
37
38
39
    
2. Create Python3.6 env and activate it


    conda create --name py36 python=3.6
    source activate py36

40
3. Install with pip  
PLian's avatar
PLian committed
41
42
43
44
45
46
47
48
49
50
51
52
53
54


    pip install .

4. Install Spearmint and Python2 environment


    param_runner init spearmint

5. Test your installation


    param_runner test
    
55
56
57
58
59
6. Show example files


    param_runner examples
    
PLian's avatar
PLian committed
60
61
62
63
64
65
66
67

## Uninstall Parameter Runner
    
    param_runner uninstall
    
Note: You can uninstall param_runner with pip, but you have to delete the spearmint and its env manually.


68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
## Using the Parameter Runner on your own computer

  1. Arrange your data.  
      For spearmint executor, a python script with the model to be optimized (e.g. braninpy) and 
      a configuration file for spearmint (e.g. config.pb) are required.  
      
      For ray_tune executor,  a python script with your Trainable class is required (e.g. hyperband_examples.py).
      Please note that to optimize your trainable class by param_runner, the `redis_address=os.environ["RAY_HEAD_IP"]`
      and `resources_per_trial={'gpu': os.environ["NUM_GPUs"]` options should be used in your `ray.init` and `tune.run`
      settings, respectively. See the below as an example,  
      
      `ray.init(redis_address=os.environ["RAY_HEAD_IP"])`  
      `... ...`  
      `run(exp, scheduler=hyperband, resources_per_trial={'gpu': os.environ["NUM_GPUs"]})`  
      
      More details can be found by running `param_runner examples` to list all examples.
      
  2. Create a parameter .yml file (see below, "Parameter File Format" section)
  3. Check the parameter .yml file: `param_runner check myexperiment.yml`
  4. Run the job on your own computer: `param_runner run myexperiment.yml`
  

David Trudgian's avatar
David Trudgian committed
90
91
## Using the Parameter Runner on the Nucleus Cluster

92
93
  1. Arrange your data. (see above)
  2. Create a parameter .yml file (see below, "Parameter File Format" section)
David Trudgian's avatar
David Trudgian committed
94
95
96
  3. Check the parameter .yml file: `param_runner check myexperiment.yml`
  4. Submit to the cluster: `param_runner submit myexperiment.yml`
  
David Trudgian's avatar
David Trudgian committed
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
When `param_runner submit` is run it will submit a job to the cluster to perform
the parameter optimization. The job number will be reported on the command line:

```
...
INFO     Submitted batch job 290006
...
```

You can monitor the status of the job using the `squeue` command, or the BioHPC
web portal. When the job starts to run it will create a log file in the
directory it was launched from, named according to the job ID that was reported
above.

```
param_runner_290008.out
```

You can examine this file with commands such as `cat` or `less`. Look for a line
stating the output directory for the optimization:

```
...
 - Trace file and output will be at: test/test_data/param_runner_20170111-12131484158381
...
```

Inside this directory you will find the master `trace.txt` file that summarizes
the parameter exploration. There are also many xxx.out and xxx.err file which
contain the full standard and error output from each individual command that was
run throughout the parameter exploration, e.g:

```
...
100.err
100.out
101.err
101.out
102.err
102.out
103.err
103.out
104.err
104.out
105.err
105.out
...

```

The number in the filename corresponds to the index column in the master summary
 `trace.txt`.
David Trudgian's avatar
David Trudgian committed
149
150
151
152


## Parameter File Format

David Trudgian's avatar
David Trudgian committed
153
154
155
156
157
158
159
160
161
162
163
164
165
166
The parameter file that defines the parameter exploration to be run is written
in a simple text based format called YAML. A simple introduction to YAML can be
found at the link below, but it should be easy to work from the example
parameter file given later.

https://learnxinyminutes.com/docs/yaml/
 
Whenever you write a parameter file be sure to `param_runner check` it before
you attempt to submit to the cluster. The check will list any problems with the
parameter file that you should correct before submitting to the cluster.

The example below includes documentation of each parameter, and can be
used as a starting point for your own parameter files

167
``` yaml
David Trudgian's avatar
David Trudgian committed
168

David Trudgian's avatar
David Trudgian committed
169
170
171
172
173
174
175
# The command to run, including any arguments that always stay the same,
# and will not be explored by the runner.
# e.g. in this example we run an imaginary program called `train_ann`,
# which could be a program to train an artificial neural network. We
# want to run it with various different training parameters, but the
# input and crossk arguments are always the same. This example is
# typical of a machine learning experiment.
David Trudgian's avatar
David Trudgian committed
176

David Trudgian's avatar
David Trudgian committed
177
command: train_ann --input train.set --crossk 10
David Trudgian's avatar
David Trudgian committed
178

David Trudgian's avatar
David Trudgian committed
179
# An optional working directory. param_runner will cd to this directory
180
181
182
# before beginning to run commands.
# work_dir: /tmp/param_work_test

David Trudgian's avatar
David Trudgian committed
183
# If optional summary entries are defined here we will look at the standard
David Trudgian's avatar
David Trudgian committed
184
185
186
187
188
189
190
191
192
# output each time the command is run, and try to extract result values
# using regular expressions that are provided. Each summary entry has
# an id, which will become a column in our trace.txt summary file.
# The regex supplied must contain a single matching group. The value it
# matches against will be written into the trace.txt file in the
# relevant column.
#
# Here we will pull at TPF and FPF values that our train_ann program
# reports when it has finished training our neural network.
David Trudgian's avatar
David Trudgian committed
193
194
195
196
197
198
199
200
201

summary:

  - id: True_Pos_Fraction
    regex: 'TPF: ([-+]?[0-9]*\.?[0-9])'

  - id: False_pos_Fraction
    regex: 'FPF: ([-+]?[0-9]*\.?[0-9])'

David Trudgian's avatar
David Trudgian committed
202
203
# Now we define cluster options, describing how to run our commands on
# 1 or more nodes on the BioHPC nucleus cluster.
David Trudgian's avatar
David Trudgian committed
204
205
206
207

# Cluster partition to use
partition: 256GB

208
209
210
211
# Generic Resource (e.g. GPU request)
# OPTIONAL: Request a SLURM generic resource like a GPU
# gres: 'gpu:1'

David Trudgian's avatar
David Trudgian committed
212
213
214
# Total number of nodes to use
nodes: 4

David Trudgian's avatar
David Trudgian committed
215
216
217
218
# Number of CPUs required by each task. Here we request 4 cpus for each
# task. On the 256GB nodes there are 48 logical cores, so the runner
# can start 12 tasks in parallel per node, for a total of 48 concurrent
# tasks across the 4 nodes we have requested.
David Trudgian's avatar
David Trudgian committed
219
220
cpus_per_task: 4

David Trudgian's avatar
David Trudgian committed
221
222
223
224
225
# Time limit - you must specify a timelimit for the job here, in the
# format <DAYS>-<HOURS>:<MINUTES>
# If the job reaches the limit it will be terminated, so please allow
# a margin of safety. Here we ask for 3 days.
time_limit: 3-00:00
David Trudgian's avatar
David Trudgian committed
226
227


David Trudgian's avatar
David Trudgian committed
228
229
230
231
232
233
234
# Modules to load - an optional list of environment modules that the
# command you wish to execute needs to be loaded
modules:
  - matlab/2013a
  - python/2.7.x-anaconda


David Trudgian's avatar
David Trudgian committed
235
# Now we configure the list of parameters we want to explore.
David Trudgian's avatar
David Trudgian committed
236
237
238
#
# For each parameter the following properties are required:
#
239
#    type: 'int_range'        A range of integers -or-
David Trudgian's avatar
David Trudgian committed
240
241
242
243
244
245
246
#           'real_range'       A range of real numbers -or-
#           'choice'           A list of string options
#
#
#  The following properties are optional:
#
#    flag: '--example'        A flag which should proceed the value of the paremeter
David Trudgian's avatar
David Trudgian committed
247
#    optional: true           If true, we consider combinations excluding this parameter
David Trudgian's avatar
David Trudgian committed
248
#
249
250
251
252
#    substitution: '%1'       If substitution is specfied, the parameter value
#    optional: true           will replace the placeholder supplied in the
#                             command, instead of being appended to the command
#
David Trudgian's avatar
David Trudgian committed
253
254
255
256
257
#  int_range and real_range types take paremeters:
#
#     min: 10                 Minimum value for the parameter
#     max: 1e4                Maximum value for the parameter
#     step: 20                Amount to add at each step through the range
David Trudgian's avatar
David Trudgian committed
258
#     scale: 10               Amount to multiply value at each step through the range
David Trudgian's avatar
David Trudgian committed
259
260
261
262
263
264
265
266
267
268
#
#   e.g. for an arithmetic progression of 10 - 100 in steps of 5:
#       min: 10
#       max: 100
#       step: 5
#
#   e.g. for a geometric progression of 1, 10, 100, 1,000, 10,000 (scale factor 10)
#       min: 1
#       max: 10000
#       scale: 10
David Trudgian's avatar
David Trudgian committed
269
270
271
272
273
274
275
276
#
#
# In the example below we explore a total of 170 parameter combinations,
# consisting of every combination of:
#   - 16 different numbers of hidden nodes
#   - 4 different regularization beta values
#   - 2 different activation functions

David Trudgian's avatar
David Trudgian committed
277
278
279

parameters:

David Trudgian's avatar
David Trudgian committed
280
281
# Even numbers from 2 up to 32 for the value of --hidden

David Trudgian's avatar
David Trudgian committed
282
283
284
  - id: 'hidden'
    type: "int_range"
    flag: '--hidden'
David Trudgian's avatar
David Trudgian committed
285
    min: 2
David Trudgian's avatar
David Trudgian committed
286
287
288
289
    max: 32
    step: 2
    description: "Number of hidden nodes"

David Trudgian's avatar
David Trudgian committed
290
291
# 0.1, 1.0, 10, 100 for the value of --beta

David Trudgian's avatar
David Trudgian committed
292
293
294
295
296
297
298
299
300
  - id: 'beta'
    type: real_range
    flag: '-b'
    optional: true
    min: 0.1
    max: 100
    scale: 10
    description: Regularization beta parameter

David Trudgian's avatar
David Trudgian committed
301
302
# Wither step or sigmoid for the value of --activation

David Trudgian's avatar
David Trudgian committed
303
304
305
306
307
308
309
  - id: 'activation'
    type: choice
    flag: '--activation'
    values:
      - step
      - sigmoid
    description: Activation function
David Trudgian's avatar
David Trudgian committed
310

311
```