This tutorial demonstrates how you can efficiently tune hyperparameters for a model using HyperDrive, Azure ML’s hyperparameter tuning functionality. You will train a Keras model on the CIFAR10 dataset, automate hyperparameter exploration, launch parallel jobs, log your results, and find the best run.
Hyperparameters are variable parameters chosen to train a model. Learning rate, number of epochs, and batch size are all examples of hyperparameters.
Using brute-force methods to find the optimal values for parameters can be time-consuming, and poor-performing runs can result in wasted money. To avoid this, HyperDrive automates hyperparameter exploration in a time-saving and cost-effective manner by launching several parallel runs with different configurations and finding the configuration that results in best performance on your primary metric.
Let’s get started with the example to see how it works!
If you don’t have access to an Azure ML workspace, follow the setup tutorial to configure and create a workspace.
The setup for your development work in this tutorial includes the following actions:
Instantiate a workspace object from your existing workspace. The following code will load the workspace details from a config.json file if you previously wrote one out with write_workspace_config()
.
Or, you can retrieve a workspace by directly specifying your workspace details:
An Azure ML experiment tracks a grouping of runs, typically from the same training script. Create an experiment to track hyperparameter tuning runs for the Keras model.
If you would like to track your runs in an existing experiment, simply specify that experiment’s name to the name
parameter of experiment()
.
By using Azure Machine Learning Compute (AmlCompute), a managed service, data scientists can train machine learning models on clusters of Azure virtual machines. In this tutorial, you create a GPU-enabled cluster as your training environment. The code below creates the compute cluster for you if it doesn’t already exist in your workspace.
You may need to wait a few minutes for your compute cluster to be provisioned if it doesn’t already exist.
cluster_name <- "gpucluster"
compute_target <- get_compute(ws, cluster_name = cluster_name)
if (is.null(compute_target))
{
vm_size <- "STANDARD_NC6"
compute_target <- create_aml_compute(workspace = ws,
cluster_name = cluster_name,
vm_size = vm_size,
max_nodes = 4)
wait_for_provisioning_completion(compute_target, show_output = TRUE)
}
A training script called cifar10_cnn.R
has been provided for you in the hyperparameter-tune-with-keras
folder.
In order to leverage HyperDrive, the training script for your model must log the relevant metrics during model training. When you configure the hyperparameter tuning run, you specify the primary metric to use for evaluating run performance. You must log this metric so it is available to the hyperparameter tuning process.
In order to log the required metrics, you need to do the following inside the training script:
library(azuremlsdk)
Take the hyperparameters as command-line arguments to the script. This is necessary so that when HyperDrive carries out the hyperparameter sweep, it can run the training script with different values to the hyperparameters as defined by the search space.
Use the log_metric_to_run()
function to log the hyperparameters and the primary metric.
log_metric_to_run("batch_size", batch_size)
...
log_metric_to_run("epochs", epochs)
...
log_metric_to_run("lr", lr)
...
log_metric_to_run("decay", decay)
...
log_metric_to_run("Loss", results[[1]])
An Azure ML estimator encapsulates the run configuration information needed for executing a training script on the compute target. Azure ML runs are run as containerized jobs on the specified compute target. The estimator is used to define the configuration for each of the child runs that the parent HyperDrive run will kick off.
To create the estimator, define the following:
source_directory
). All the files in this directory are uploaded to the cluster node(s) for execution. The directory must contain your training script and any additional scripts required.entry_script
).compute_target
), in this case the AmlCompute cluster you created earlier.hyperparameter-tune-with-keras/
folder for reference. See the r_environment()
reference for the full set of configurable options.To kick off hyperparameter tuning in Azure ML, you will need to configure a HyperDrive run, which will in turn launch individual children runs of the training scripts with the corresponding hyperparameter values.
In this experiment, we will use four hyperparameters: batch size, number of epochs, learning rate, and decay. In order to begin tuning, we must define the range of values we would like to explore from and how they will be distributed. This is called a parameter space definition and can be created with discrete or continuous ranges.
Discrete hyperparameters are specified as a choice among discrete values represented as a list.
Advanced discrete hyperparameters can also be specified using a distribution. The following distributions are supported:
quniform(low, high, q)
qloguniform(low, high, q)
qnormal(mu, sigma, q)
qlognormal(mu, sigma, q)
Continuous hyperparameters are specified as a distribution over a continuous range of values. The following distributions are supported:
uniform(low, high)
loguniform(low, high)
normal(mu, sigma)
lognormal(mu, sigma)
Here, we will use the random_parameter_sampling()
function to define the search space for each hyperparameter. batch_size
and epochs
will be chosen from discrete sets while lr
and decay
will be drawn from continuous distributions. If you wish to fix a script parameter’s value, simply remove it from your sampling function list, and it will be excluded from tuning and kept at the value assigned to it in the estimator step.
Other available sampling function options are:
To prevent resource waste, Azure ML can detect and terminate poorly performing runs. HyperDrive will do this automatically if you specify an early termination policy.
Here, you will use the bandit_policy()
, which terminates any runs where the primary metric is not within the specified slack factor with respect to the best performing training run.
Other termination policy options are:
If no policy is provided, all runs will continue to completion regardless of performance.
Now, you can create a HyperDriveConfig
object to define your HyperDrive run. Along with the sampling and policy definitions, you need to specify the name of the primary metric that you want to track and whether we want to maximize it or minimize it. The primary_metric_name
must correspond with the name of the primary metric you logged in your training script. max_total_runs
specifies the total number of child runs to launch. See the hyperdrive_config() reference for the full set of configurable parameters.
Finally submit the experiment to run on your cluster. The parent HyperDrive run will launch the individual child runs. submit_experiment()
will return a HyperDriveRun
object that you will use to interface with the run. In this tutorial, since the cluster we created scales to a max of 4
nodes, all 4 child runs will be launched in parallel.
You can view the HyperDrive run’s details as a table. Clicking the “Web View” link provided will bring you to Azure Machine Learning studio, where you can monitor the run in the UI.
Wait until hyperparameter tuning is complete before you run more code.
Finally, you can view and compare the metrics collected during all of the child runs!
# Get the metrics of all the child runs
child_run_metrics <- get_child_run_metrics(hyperdrive_run)
child_run_metrics
# Get the child run objects sorted in descending order by the best primary metric
child_runs <- get_child_runs_sorted_by_primary_metric(hyperdrive_run)
child_runs
# Directly get the run object of the best performing run
best_run <- get_best_run_by_primary_metric(hyperdrive_run)
# Get the metrics of the best performing run
metrics <- get_run_metrics(best_run)
metrics
The metrics
variable will include the values of the hyperparameters that resulted in the best performing run.