Suggested Protocol

Here we provide a suggested protocol for more rigorously benchmarking deep learning optimizer. Some of the steps were discussed in the DeepOBS paper. Others were derived in the Master's thesis of Aaron Bahde.

Decide for a Framework

DeepOBS versions >= 1.2.0 support TensorFlow and PyTorch. We ran some basic experiments to check whether these two frameworks can be used interchangeably. So far, we strongly recommend to NOT compare benchmarks (with DeepOBS) across these frameworks. Currently, we only provide baselines for PyTorch.

You can choose between PyTorch and TensorFlow by switching the import statements:

# for example import the standard runner from the pytorch submodule
from deepobs.pytorch.runners import StandardRunner
# or from the tensorflow submodule
from deepobs.tensorflow.runners import StandardRunner

Create new Run Script

In order to benchmark a new optimization method a new run script has to be written. A more detailed description can be found in the Simple Example and the API section for TensorFlow (Standard Runner) and PyTorch (Standard Runner). Essentially, all which is needed is the optimizer itself and a list of its hyperparameters. For example for the Momentum optimizer in Tensorlow this will be:

"""Example run script using StandardRunner."""

import tensorflow as tf

from deepobs import tensorflow as tfobs

optimizer_class = tf.train.MomentumOptimizer
hyperparams = {
    "learning_rate": {"type": float},
    "momentum": {"type": float, "default": 0.99},
    "use_nesterov": {"type": bool, "default": False},
}

runner = tfobs.runners.StandardRunner(optimizer_class, hyperparams)
runner.run()

And in PyTorch:

"""Example run script using StandardRunner."""

from torch.optim import SGD

from deepobs import pytorch as pt

optimizer_class = SGD
hyperparams = {
    "lr": {"type": float},
    "momentum": {"type": float, "default": 0.99},
    "nesterov": {"type": bool, "default": False},
}

runner = pt.runners.StandardRunner(optimizer_class, hyperparams)
runner.run()

(Possibly) Write Your Own Runner

You should at first try to execute your optimizer with one of the implemented runner classes. If this does not work out, because your optimizer needs additional access to the training loop, you have to write your own runner class. We provide a description how to do this: How to Write Customized Runner

Identify Tunable Hyperparameters

We suggest that you decide which hyperparameters of your optimizer needs to be tuned before starting the benchmark. For every test problem you should tune exactly the same hyperparameters with the same resources and the same tuning method. This avoids overfitting of hyperparameters on specific test problems.

Decide for a Tuning Method

We provide three tuning classes in DeepOBS. You should use one of them:

# Grid Search
# Bayesian optimization with a Gaussian process surrogate
# Random Search
from deepobs.tuner import GP, GridSearch, RandomSearch

Ideally, you use the same tuning method that we used for the baselines. At the moment this is grid search.

Specify the Tuning Domain

Prospective users of your optimizer expect you to provide information about how to tune your optimizer in any application. Therefore, you should provide promising search domains. They should be the same for all test problems since the users do not know the link between your optimizer's hyperparameters and the application. In DeepOBS you can user the tuning specifications of each tuner class. This is an example for the Momentum optimizer in PyTorch:

import numpy as np
from scipy.stats.distributions import binom, uniform
from torch.optim import SGD

from deepobs.pytorch.runners import StandardRunner
from deepobs.tuner import GP, GridSearch, RandomSearch
from deepobs.tuner.tuner_utils import log_uniform

# define optimizer
optimizer_class = SGD
hyperparams = {
    "lr": {"type": float},
    "momentum": {"type": float},
    "nesterov": {"type": bool},
}

### Grid Search ###
# The discrete values to construct a grid for.
grid = {
    "lr": np.logspace(-5, 2, 6),
    "momentum": [0.5, 0.7, 0.9],
    "nesterov": [False, True],
}

# Make sure to set the amount of resources to the grid size. For grid search, this is just a sanity check.
tuner = GridSearch(
    optimizer_class,
    hyperparams,
    grid,
    runner=StandardRunner,
    ressources=6 * 3 * 2,
)

### Random Search ###
# Define the distributions to sample from
distributions = {
    "lr": log_uniform(-5, 2),
    "momentum": uniform(0.5, 0.5),
    "nesterov": binom(1, 0.5),
}

# Allow 36 random evaluations.
tuner = RandomSearch(
    optimizer_class,
    hyperparams,
    distributions,
    runner=StandardRunner,
    ressources=36,
)

### Bayesian Optimization ###
# The bounds for the suggestions
bounds = {"lr": (-5, 2), "momentum": (0.5, 1), "nesterov": (0, 1)}


# Corresponds to rescaling the kernel in log space.
def lr_transform(lr):
    return 10 ** lr


# Nesterov is discrete but will be suggested continious.
def nesterov_transform(nesterov):
    return bool(round(nesterov))


# The transformations of the search space. The momentum parameter does not need a transformation.
transformations = {"lr": lr_transform, "nesterov": nesterov_transform}

tuner = GP(
    optimizer_class,
    hyperparams,
    bounds,
    runner=StandardRunner,
    ressources=36,
    transformations=transformations,
)

Bound the Tuning Resources

The tuning of your optimizer's hyperparameters should never exceed the number of instances that were used for the baselines. Less is always better. For our current baselines we used 20 instances for each optimizer on each test problem. Use the ressources argument in the tuner class instantiation to limit them.

Report Stochasticity

To get an understanding of the robustness of the optimizer against training noise we recommend to rerun the best hyperparameter instance of your optimizer with 10 different random seeds. The tuning classes can automatically take care of it:

import numpy as np
from torch.optim import SGD

from deepobs.pytorch.runners import StandardRunner
from deepobs.tuner import GridSearch

# define optimizer
optimizer_class = SGD
hyperparams = {"lr": {"type": float}}

### Grid Search ###
# The discrete values to construct a grid for.
grid = {"lr": np.logspace(-5, 2, 6)}

# init tuner class
tuner = GridSearch(
    optimizer_class, hyperparams, grid, runner=StandardRunner, ressources=6
)

# tune on quadratic test problem and automatically rerun the best instance with 10 different seeds.
tuner.tune("quadratic_deep", rerun_best_setting=True)

Run on a Variety of Test Problems

Benchmark results might vary a lot for different test problems. We recommend to run your optimizer on as many test problems as possible but (of course) focus on the ones we use for the baselines. We provide a 'small' test set and a 'large' test set that, in our opinion, reflects a good variety of test problems. They are accessible as global variables in DeepOBS. One way to use them is to automatically tune your optimizer on the recommendations:

import numpy as np
from torch.optim import SGD

from deepobs.config import get_small_test_set
from deepobs.pytorch.runners import StandardRunner
from deepobs.tuner import GridSearch

# define optimizer
optimizer_class = SGD
hyperparams = {"lr": {"type": float}}

### Grid Search ###
# The discrete values to construct a grid for.
grid = {"lr": np.logspace(-5, 2, 6)}

# init tuner class
tuner = GridSearch(
    optimizer_class, hyperparams, grid, runner=StandardRunner, ressources=6
)

# get the small test set and automatically tune on each of the contained test problems
small_testset = get_small_test_set()
tuner.tune_on_testset(
    small_testset, rerun_best_setting=True
)  # kwargs are parsed to the tune() method

Plot Results

To visualize the final results, the user can use the Analyzer API. We recommend to include a plot about the hyperparameter sensitivity and to plot your optimizer performance against the baselines:

from deepobs.analyzer.analyze import (plot_hyperparameter_sensitivity,
                                      plot_optimizer_performance)

# plot your optimizer against baselines
plot_optimizer_performance(
    "/<path to your results folder>/<test problem>/<your optimizer>",
    reference_path="<path to the baselines>/<test problem>/SGD",
)

# plot the hyperparameter sensitivity (here we use the learning rate sensitivity of the SGD baseline)
plot_hyperparameter_sensitivity(
    "<path to the baselines>/<test problem>/SGD",
    hyperparam="lr",
    xscale="log",
    plot_std=True,
)

Report Measures for Speed

DeepOBS calculates the speed of your optimizer as a fraction of epochs that it needs to reach the convergence performance of the baselines. This measure is included automatically in the overview table generated by the Analyzer. Additionally, you can calculate an estimate for wall-clock time performance in comparison to SGD. More details can be found in the DeepOBS paper

from torch.optim import Adam

from deepobs.analyzer.analyze import estimate_runtime, plot_results_table
from deepobs.pytorch.runners import StandardRunner

# plot the overview table which contains the speed measure for iterations
plot_results_table(
    "<path to your results>",
    conv_perf_file="<path to the convergence performance file of the baselines>",
)

# briefly run your optimizer against SGD to estimate wall-clock time overhead, here we use Adam as an example
estimate_runtime(
    framework="pytorch",
    runner_cls=StandardRunner,
    optimizer_cls=Adam,
    optimizer_hp={"lr": {"type": float}},
    optimizer_hyperparams={"lr": 0.1},
)