HPO API

HPO Inputs

Params

An example of a fully constructed params argument:

from crayai import hpo

my_params = hpo.Params([['--learning_rate', 1e-4, (1e-6, 1)]
                        ['--dropout_rate', (0.1, 1)],
                        ['--optimizer', 'sgd', ['sgd', 'gd', 'adam']])
class crayai.hpo.params.Params(params)

The params class stores the hyperparameters to be optimized with a chosen strategy.

Parameters

params – List of lists of hyperparameter information describing the flags associated with hyperparameters, their default values, and their possible values.

Each element of the params list must follow one of these formats:

# Hyperparameter search space expressed as tuple of bounds
[flag_string, default_value, (lower_bound, upper_bound)]

# Hyperparameter search space expressed as list of values
[flag_string, default_value, [value1, value2, value3, .., valueN]]

# Hyperparameter search space without a default value chooses an initial value at random
[flag_string, [value1, value2, value3, .., valueN]]
[flag_string, (lower_bound, upper_bound)]
detect_cycle(self, g) → bool

Determine if there is any cycle in the dependency graph

Args:

g: nx.DiGraph object

Returns:

bool value

get_dependency_graph_as_str(self) → str

Return concatenated strings of condition separated by ;

get_ordered_param_when_valid(self) → List[str]

Return a list of ordered parameters when there is no cycle in the graph

get_params(self)

Get hyperparameters.

Returns:

dict(string, string): The key of the dictionary is the name of the hyperparameter e.g --learning_rate, and the value is the value of the hyperparameter, e.g. 0.01.

serialize_graph(self, graph) → List[str]

Returns an ordered list of nodes. Assumes graph has no cycle

Evaluator

class crayai.hpo.evaluator.Evaluator(cmd, **kwargs)

The evaluator class defines how to evaluate a set of hyperparameters by running the kernel program (model training script) with command line arguments.

Parameters
cmd: str

Shell command used to evaluate the hyperparameters (without hyperparameter flags). For example, python relative/path/to/model.py. The path to the kernel script should be a relative path.

  • If src_path is defined, cmd should be set as if it is run from src_path.

  • If src_path is not defined, cmd should be set as if it is run from the current working directory.

src_path: str, optional (default=””)

Path to source files. src_path must be defined in order to use run_path. If src_path="", all evaluations take place in the current working directory. Can be a relative or absolute path.

run_path: str, optional (default=’run’)

Top-level workspace directory to create subdirectories for running evaluations and generating log files in. src_path must be set in order for run_path to be used, otherwise a warning will be generated and all runs will take place in the current working directory. Can be a relative of absolute path.

metric: Any, optional (default={‘FoM’: 1.0})

Dictionary (or a string) containing evaluation metrics and their weights to use during hyperparameter optimization. A weighted sum is used when multiple metrics are specified by the user. When a string is specified (e.g., ‘f1: ‘), it is considered as a single metric with 100% weight assigned to it. Metric names are assumed to be unique strings that are searched in the evaluation output. When multiple metrics are specified, CrayAI should be built with regex support i.e., CHPL_REGEXP=re2. Given a single metric is used, evaluation output is parsed using regex when CHPL_REGEXP=re2. Otherwise, string matching is used.

checkpoint: str, optional (default=””)

Path to checkpoint directory per workspace. Required for using PBT optimizer.

nodes: int, optional (default=0)

Number of nodes to allocate for distributed training. Ignored when using an existing allocation. Setting nodes>1 with workload_manager='local' will generate an error. Designates the number of pods when using Kubernetes.

  • If <=1, then allocate 1 node.

  • If >1, then allocate that many nodes.

nodes_per_eval: int, optional, (default=0)

Number of nodes to run for each evaluation. Only applicable if evaluation supports distributed execution. If nodes_per_eval is not set then one node will be used per evaluation unless nodes per evaluation is a hyperparameter.

num_parallel_evals: int, optional, (default=0)

Number of evaluations to run in parallel.

  • If 0, then run nodes/nodes_per_eval evaluations in parallel.

  • If >0, then run that many evaluations in parallel.

workload_manager: string, optional (default=’auto’)

Workload manager to be used for acquiring and managing allocations.

  • If ‘auto’, then detect workload manager; run locally if no workload_manager is found.

  • If ‘local’, then use no workload_manager (locally).

  • If ‘slurm’, use slurm workload manager.

  • If ‘pbs’, use PBS workload manager.

  • If ‘k8s’, use Kubernetes as workload manager (must also use ‘k8s’ as the launcher).

launcher: string, optional, (default=’auto’)

Launcher to be used for executing evaluations. This can be left as ‘auto’ unless using a different launcher than the workload manager.

  • If ‘auto’, then inherit from the workload_manager.

  • If ‘local’, then run with no launcher (locally).

  • If ‘slurm’, use slurm (srun) as launcher.

  • If ‘urika’, use urika (run_training) as launcher.

  • If ‘k8s’, use Kubernetes as the launcher (must also use ‘k8s as the workload manager).

workload_image: string, optional (default=””)

The image containing the workload platform to be used in the evaluation (ex. TensorFlow, PyTorch, etc.). This should be the specific name of the image in the registry, ex: ‘shasta-tensorflowv1.15-ubuntu:latest’. Currently only supported when the ‘workload_manager’ is ‘k8s’ and is a requirement for this option.

launch_args: string, optional (default=””)

Flags to pass to launcher command.

alloc_jobid: int, optional, (default=0)

Job ID specifiying what allocation to use. Currently only supports workload_manager='slurm'. Can be omitted if job id is available through environment variables such as SLURM_JOBID.

alloc_timeout: int, optional, (default=30)

Number of minutes requested in allocation.

alloc_args: string, optional, (default=””)

Additional arguments to be passed to the allocation command. For example alloc_args='-C haswell'.

timeout: real (default=0.0)

Time budget for all evaluations (minutes)

flag: dict, optional, (default={})

Flags used to pass information (e.g. nodes used in an evaluation) from the evaluator to the kernel script. Possible keys for flags are: ‘nodes_per_evaluation’.

For example, setting flag={'nodes_per_evaluation': '--N'} allows nodes per evaluation to be represented by the flag --N.

verbose: bool, optional, (default=False)

Enable verbose output for evaluation and job management.

num_retries: int, optional (default=0)

Number of times a failed evaluation can be re-attempted.

Examples

from crayai import hpo

# Local evaluation
evaluator = hpo.Evaluator('python3 relative/path/to/train_model.py',
                           workload_manager='local')

# Distributed evaluation, where Evaluator will allocate nodes
evaluator = hpo.Evaluator('python3 relative/path/to/train_model.py',
                           workload_manager='slurm',
                           nodes=4)

# Evaluation with multiple metrics
evaluator = hpo.Evaluator('python3 relative/path/to/train_model.py --FoM',
                           metric={
                            'Metric_categorical_crossentropy': 0.5,
                            'Metric_weighted_categorical_crossentropy': 0.5
                           }
                        )

Condition

An example of conditional hyperparameters

from crayai import hpo
from crayai.hpo import condition

evaluator = hpo.Evaluator('python source/sin.py')

params = hpo.Params([["-a", 1.0, (1, 10.0)],
                     ["-b", 1.0, (1, 10.0)],
                     ["-c", 1.0, (1, 10.0)],
                     ["-d", 1.0, (1, 10.0)],
                     ["-e", 1.0, (1, 10.0)],
                     ["-f", 1.0, (1, 10.0)],
                     ["-g", 1.0, (1, 10.0)]])

conditions = [                     condition.greater_than('-b','-a', 2),
                condition.greater_than('-b', '-c', 3),
                condition.less_than('-f', '-d', 2),
                condition.less_than('-e', '-b', 4),
                condition.less_than('-a', '-g', 5),
                condition.less_than('-c', '-g', 5)
            ]

params.add_conditions(conditions)
class crayai.hpo.condition.Condition(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = '')

Parent class for encoding dependency between hyperparameters

type_as_str(self, value)

Get type as a clean string such as ‘int’, ‘float’ or ‘list(str)’

class crayai.hpo.condition.equals(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = '==')

Condition subclass where a hyperparameter is used only when parent hyperparameter meets equality condition (e.g., b | a == 1)

class crayai.hpo.condition.greater_than(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = '>')

Condition subclass where a hyperparameter is used only when parent hyperparameter is greater than a given value (e.g., b | a > 5.0)

class crayai.hpo.condition.inside(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = 'in')

Condition subclass where a hyperparameter is used only when parent hyperparameter are sampled from a list of values (e.g., b | a in [1, 2, 3, 4])

class crayai.hpo.condition.less_than(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = '<')

Condition subclass where a hyperparameter is used only when parent hyperparameter is less than a given value (e.g., b | a < 10.0)

class crayai.hpo.condition.not_equals(dependent_param: str, parent_param: str, parent_param_value: Any, dependent_param_value: Any = None, operator: str = '!=')

Condition subclass where a hyperparameter is used only when parent hyperparameter meets inequality condition (e.g., b | a != 1)

HPO Strategies

Genetic HPO

Population-based Training

Population-based training (PBT) requires enabling checkpointing, which requires some modifications to the model training program code, as well as additional arguments to the Optimizer constructor.

class crayai.hpo.genetic_optimizer.GeneticOptimizer(evaluator=None, **kwargs)

Genetic optimizer

Employs a genetic search to hyperparameters to minimize the figure of merit (FoM). Hyperparameters can be optimized from a list of values (e.g. learning rate selected from a list of values [1e-4, 1e-3, 1e-2, 1e-1]) or a range of values (e.g. any real number between (1e-4, 1e-1)). If hyperparameters are optimized from a list of values then only values in the list are searched and mutations occur with respect to their indices. For example, a hyperparameter at index 20 in a list of 100 elements will have a higher chance of mutating to a nearby index, such as 21 or 19.

Parameters
  • evaluator (Evaluator) – Evaluator instance

  • generations (int) – Number of generations. Defaults to 1000.

  • num_demes (int) – Number of distinct demes (populations). Defaults to 4.

  • pop_size (int) – Number of individuals per deme. Total number of individuals per generation is num_demes * pop_size. Defaults to 64.

  • mutation_rate (float) – Probability of mutation per hyperparameter during creation of next generation. Can be 0.0 to 1.0. Defaults to 0.05 (5%).

  • crossover_rate (float) – Probability of crossover per hyperparameter during creation of next generation. Can be 0.0 to 1.0. Defaults to 0.33 (33%).

  • migration_interval (float) – Interval of migration between demes. Defaults to 5.

  • mul_mutation_bounds (list) – Bounds on mutation percentages. Index 0 is the upper bound of a small mutation, index 1 is a lower bound on large mutation, and index 2 is an upper bound on large mutation. Defaults to [0.01, 0.1, 0.2] ([1%, 10%, 20%]).

  • add_mutation_bounds (list) – Bounds on addition percentages. Index 0 is the upper bound of a small mutation, index 1 is a lower bound on large mutation, and index 2 is an upper bound on large mutation. Defaults to [0.03, 0.03, 0.13] ([3%, 3%, 13%]).

  • name (str) – Experiment name used as prefix for log filenames to record results of optimization. Defaults to empty string "".

  • verbose (bool) – Enable verbose output. Defaults to False.

best_fom = None

Field of merit associated with best hyperparameters

best_params = None

Best set of hyperparameters found

optimize(self, params)

Optimize input hyperparameters with strategy

Parameters

params (Params) – Hyperparameters to optimize