5.
HYPERPARAMETERS AND VALIDATION SETS
Most machine learning algorithms have several settings that we can use to control
the behavior of the learning algorithm. These settings are called
hyperparameters.
The values of hyperparameters are not adapted by the learning algorithm itself.
The degree of the polynomial, which acts as a capacity hyper- parameter.
The λ value used to control the strength of weight decay is another example of
a hyperparameter.
Sometimes a setting is chosen to be a hyper parameter that the learning
algorithm does not learn because it is difficult to optimize.
More frequently, the setting must be a hyperparameter because it is not
appropriate to learn that hyperparameter on the training set. This applies to all
hyperparameters that control model capacity.
If learned on the training set, such hyperparameters would always choose the
maximum possible model capacity, resulting in overfitting.
For example,
we can always fit the training set better with a higher degree polynomial and a
weight decay setting of λ = 0 than we could with a lower degree polynomial and a
positive weight decay setting.
To solve this problem, we need a validation set of examples that the training
algorithm does not observe.
It is important that the test examples are not used in any way to make choices
about the model, including its hyperparameters
For this reason, no example from the test set can be used in the validation set.
Therefore, we always construct the validation set from the training data.
Specifically, we split the training data into two disjoint subsets.
One of these subsets is used to learn the parameters.
The other subset is our validation set, used to estimate the generalization error
during or after training, allowing for the hyperparameters to be updated
accordingly.
1
The subset of data used to learn the parameters is still typically called the
training set, even though this may be confused with the larger pool of data used
for the entire training process.
The subset of data used to guide the selection of hyperparameters is called the
validation set.
Typically one uses about 80% of the training data for training and 20% for
validation.
Since the validation set is used to “train” the hyperparameters, the validation set
error will underestimate the generalization error, though typically by a smaller
amount than the training error.
After all hyperparameter optimization is complete, the generalization error may
be estimated using the test set.
Cross-Validation
Dividing the dataset into a fixed training set and a fixed test set can be
problematic if it results in the test set being small.
A small test set implies statistical uncertainty around the estimated average
test error, making it difficult to claim that algorithm A works better than
algorithm B on the given task.
When the dataset has hundreds of thousands of examples or more, this is not
a serious issue.
When the dataset is too small, are alternative procedures enable one to use
all of the examples in the estimation of the mean test error, at the price of
increased computational cost.
These procedures are based on the idea of repeating the training and testing
computation on different randomly chosen subsets or splits of the original
dataset.
2
The k-fold cross-validation algorithm