The goal of the package mlexperiments
is to provide an
extensible framework for reproducible machine learning experiments,
namely:
mlexperiments::MLTuneParameters
, to optimize the
hyperparameters in a k-fold cross-validation with one of the two
strategies
ParBayesianOptimization
R package)mlexperiments::MLCrossValidation
, to validate one
hyperparameter settingmlexperiments::MLNestedCV
, which basically combines the two
experiments above to perform a hyperparameter optimization on an inner
CV loop, and to validate the best hyperparameter setting on an outer CV
loopThe package provides a minimal shell for these experiments, and -
with few adjustments - users can prepare different learner algorithms so
that they can be used with mlexperiments
.
This vignette will go through the steps that are necessary to prepare a new learner.
In general, the learner class exposes 4 methods that can be defined:
$fit
A wrapper around the private function
fun_fit
, which needs to be defined for every learner. The
return value of this function is the fitted model.$predict
A wrapper around the private function
fun_predict
, which needs to be defined for every learner.
The function must accept the three arguments model
,
newdata
, and ncores
and is a wrapper around
the respective learner’s predict-function. In order to allow the passing
of further arguments, the ellipsis (...
) can be used. The
function should return the prediction results.$cross_validation
A wrapper around the private function
fun_optim_cv
, which needs to be defined when
hyperparameters should be optimized with a grid search (required for use
with mlexperiments::MLTuneParameters
, and
mlexperiments::MLNestedCV
).$bayesian_scoring_function
A wrapper around the private
function fun_bayesian_scoring_function
, which needs to be
defined when hyperparameters should be optimized with a Bayesian process
(required for use with mlexperiments::MLTuneParameters
, and
mlexperiments::MLNestedCV
).In the following, we will go through the steps to prepare the
algorithm ‘class::knn()
’ to be used with
mlexperiments
(the same code is also implemented in the
package and ready to use as mlexperiments::LearnerKnn
).
mlexperiments
fit
MethodThis method must take the arguments x
, y
,
ncores
, seed
, as well as the ellipsis
(...
), while arguments to parameterize the learner are to
be passed to the function with the latter. The fit
method
should include one call to fit a model of the algorithm and it should
finally return the fitted model.
<- function(x, y, ncores, seed, ...) {
knn_fit <- list(...)
kwargs stopifnot("k" %in% names(kwargs))
<- kdry::list.append(list(train = x, cl = y), kwargs)
args $prob <- TRUE
argsset.seed(seed)
<- do.call(class::knn, args)
fit return(fit)
}
predict
MethodThis method must take the arguments model
,
newdata
, ncores
, and the ellipsis
(...
). It is a wrapper around the respective algorithm’s
predict()
function, while specific arguments required to
parameterize it can be passed with the ellipsis. The experiments
mlexperiments::MLCrossValidation
and
mlexperiments::MLNestedCV
do both have the field
$predict_args
to define a list that is further passed on to
the predcit
method’s ellipsis. In contrast, when it is
required to further parameterize this method during the hyperparameter
tuning (mlexperiments::MLTuneParameters
), it is required to
define those parameters within the cross_validation
method
(see below). The returned value of the predict
method
should be a vector with the predictions.
<- function(model, newdata, ncores, ...) {
knn_predict <- list(...)
kwargs stopifnot("type" %in% names(kwargs))
if (kwargs$type == "response") {
return(model)
else if (kwargs$type == "prob") {
} # there is no knn-model but the probabilities predicted for the test data
return(attributes(model)$prob)
} }
The implementation of class::knn()
is in some ways
special and different from the implementation of other algorithms. One
of these peculiarities is that class::knn()
does not return
a fitted model but instead returns the predicted values directly.
Depending on the value of the argument prob
, these results
also include the probability values of the predicted classes.
cross_validation
MethodThe purpose of this function is to perform a k-fold cross validation
for one specific hyperparameter setting. The function must take the
arguments x
, y
, params
(a list of
hyperparameters), fold_list
(to define the cross-validation
folds), ncores
, and seed
. Finally, the
function must return a named list with at least one item called
metric_optim_mean
, which contains the cross validated error
metric.
<- function(x, y, params, fold_list, ncores, seed) {
knn_optimization stopifnot(is.list(params), "k" %in% names(params))
# initialize a dataframe to store the results
<- data.table::data.table(
results_df "fold" = character(0),
"metric" = numeric(0)
)# we do not need test here as it is defined explicitly below
"test"]] <- NULL
params[[# loop over the folds
for (fold in names(fold_list)) {
# get row-ids of the current fold
<- fold_list[[fold]]
train_idx # create learner-arguments
<- kdry::list.append(
args list(
x = kdry::mlh_subset(x, train_idx),
test = kdry::mlh_subset(x, -train_idx),
y = kdry::mlh_subset(y, train_idx),
use.all = FALSE,
ncores = ncores,
seed = seed
),
params
)set.seed(seed)
<- do.call(knn_fit, args)
cvfit # optimize error rate
<- metric("ce") # nolint
FUN <- FUN(predictions = knn_predict(
err model = cvfit,
newdata = kdry::mlh_subset(x, -train_idx),
ncores = ncores,
type = "response"
),ground_truth = kdry::mlh_subset(y, -train_idx)
)<- data.table::rbindlist(
results_df l = list(results_df, list("fold" = fold, "validation_metric" = err)),
fill = TRUE
)
}<- list("metric_optim_mean" = mean(results_df$validation_metric))
res return(res)
}
bayesian_scoring_function
MethodThis function can be thought of as a “gatekeeper” that takes a new
suggested hyperparameter configuration from the Bayesian process and
forwards this configuration further on to a call of the
cross_validation
method (see above) in order to evaluate
this specific setting. However, some peculiarities must be considered in
this regard:
The functions needs to take the hyperparameters that should be
optimized as function arguments (I generally use the ellipsis
(...
), however, the hyperparameters can also be defined as
arguments explicitly).
When using the strategy = "bayesian"
, the package is
configured in a way that the Bayesian process is parallelized, hence
parallel threads evaluate different hyperparameter settings
simultaneously (see ParBayesianOptimization's Readme
for more details). Therefore, the call to the
cross_validation
method must explicitly specify
ncores = 1L
in order to no get in trouble with requesting
more resources than available, when using the
strategy = "bayesian"
.
The value returned from the Bayesian scoring function must be a
named list that contains the optimization metric as the item
Score
. As described above, the returned value from
cross_validation
is already a named list that contains the
optimization metric with the item metric_optim_mean
. As
this item is required later on internally for the
mlexperiments
package , the value value of this item is
just copied and saved under the new name “Score” to address the
requirements of ParBayesianOptimization
. Note: please
notice that mlexperiments
already takes care of the
direction of the optimization metric, which is handled depending on the
learner’s initialization argument
metric_optimization_higher_better
, so no changes should be
made here to ensure a correct functioning.
<- function(...) { # nolint
knn_bsF <- list(...)
params # call to knn_optimization here with ncores = 1, since the Bayesian search
# is parallelized already / "FUN is fitted n times in m threads"
set.seed(seed)#, kind = "L'Ecuyer-CMRG")
<- knn_optimization(
bayes_opt_knn x = x,
y = y,
params = params,
fold_list = method_helper$fold_list,
ncores = 1L, # important, as bayesian search is already parallelized
seed = seed
)<- kdry::list.append(
ret list("Score" = bayes_opt_knn$metric_optim_mean),
bayes_opt_knn
)return(ret)
}
More details on the package ParBayesianOptimization
and on how to define the Bayesian scoring function can be found in its
package
vignette.
For the parallelization of the Bayesian process, all required
functions must be exported to the cluster. To facilitate this, a simple
wrapper function can be created that returns a character vector of all
custom functions that are called from within the Bayesian scoring
function. The following function shows the objects that need to be
exported for a correct functioning of the LearnerKnn
:
# define the objects / functions that need to be exported to each cluster
# for parallelizing the Bayesian optimization.
<- function() {
knn_ce c("knn_optimization", "knn_fit", "knn_predict", "metric", ".format_xy")
}
Finally, all of these created functions need to be integrated into a
learner object. This is basically done by overwriting the placeholders
in an R6 learner that inherits from
mlexperiments::MLLearnerBase
.
The placeholders are:
Name | Type | Description |
---|---|---|
private$fun_fit |
function | A function to fit a model of the respective algorithm. The function must return the fitted model. |
private$fun_predict |
function | A function to predict the outcome in new data. The returned value of
the predict method should be a vector with the
predictions. |
private$fun_optim_cv |
function | A function to perform a k-fold cross-validation for one
hyperparameter setting. The function must return a named list with at
least one item called metric_optim_mean , which contains the
cross validated error metric. |
private$fun_bayesian_scoring_function |
function | A function that is defined according to the requirements of the ParBayesianOptimization
R package. It must return a named list that contains the optimization
metric as the item Score . |
self$environment |
field | The environment, where to search for the objects that need to be
exported to a parallel cluster (required for Bayesian optimization).
When the R6 learner is part of an R package, you can write the name of
the R package here. Otherwise, -1L (the global environment)
might be suitable as long as all objects that are defined in the field
cluster_export are available from the global
environment. |
self$cluster_export |
field | A character vector with the names of objects that need to be exported to each node of a parallel cluster when performing a Bayesian optimization. |
These assignments should be done in the initialize()
function. The following code example shows the assignment of the
previously created functions to the respective functions and fields of
the newly created R6 class LearnerKnn
:
<- R6::R6Class( # nolint
LearnerKnn classname = "LearnerKnn",
inherit = mlexperiments::MLLearnerBase,
public = list(
initialize = function() {
if (!requireNamespace("class", quietly = TRUE)) {
stop(
paste0(
"Package \"class\" must be installed to use ",
"'learner = \"LearnerKnn\"'."
),call. = FALSE
)
}$initialize(
supermetric_optimization_higher_better = FALSE # classification error
)
$fun_fit <- knn_fit
private$fun_predict <- knn_predict
private$fun_optim_cv <- knn_optimization
private$fun_bayesian_scoring_function <- knn_bsF
private
$environment <- "mlexperiments"
self$cluster_export <- knn_ce()
self
}
) )
Please note that metric_optimization_higher_better
is
defaulted to FALSE
here when initializing the super-class.
This is because of choosing the error rate as the optimization metric
(FUN <- metric("ce")
) when defining the
cross_validation
-function above.
Now, the learner is put together and ready to be used with
mlexperiments
:
First of all, load the data and transform it into a matrix, and define the training data and the target variable.
library(mlexperiments)
library(mlbench)
data("DNA")
<- DNA |>
dataset ::as.data.table() |>
data.tablena.omit()
<- 123
seed <- colnames(dataset)[1:180]
feature_cols
<- model.matrix(
train_x ~ -1 + .,
.SDcols = feature_cols]
dataset[, .SD,
)<- dataset[, get("Class")]
train_y
<- ifelse(
ncores test = parallel::detectCores() > 4,
yes = 4L,
no = ifelse(
test = parallel::detectCores() < 2L,
yes = 1L,
no = parallel::detectCores()
)
)if (isTRUE(as.logical(Sys.getenv("_R_CHECK_LIMIT_CORES_")))) {
# on cran
<- 2L
ncores }
For the Bayesian hyperparameter optimization, it is required to
define a grid with some hyperparameter combinations that is used for
initializing the Bayesian process. Furthermore, the borders (allowed
extreme values) of the hyperparameters that are actually optimized need
to be defined in a list. Finally, further arguments that are passed to
the function ParBayesianOptimization::bayesOpt()
can be
defined as well.
<- expand.grid(
param_list_knn k = seq(4, 68, 8),
l = 0,
test = parse(text = "fold_test$x")
)
<- list(k = c(2L, 80L))
knn_bounds
<- list(
optim_args iters.n = ncores,
kappa = 3.5,
acq = "ucb"
)
Here, another peculiarity of class::knn()
is visible:
when fitting a model, one needs to specify the argument
test
in order to specify a matrix of test set cases. In
order to have the correct test set cases selected throughout the
cross-validation, one needs to specify argument as an expression, which
is then evaluated before passing the arguments on to the
fit
-function.
Generally speaking, this is a feature implemented in
mlexperiments
: when specifying an expression as a learner
argument (either via the R6 classes’ fields learner_args
or
parameter_grid
), this expression is evaluated before
passing the argument list on the fitting functions.
In order to execute the parameter tuning, the created objects need to
be assigned to the corresponding fields of the R6 class
mlexperiments::MLTuneParameters
:
<- mlexperiments::MLTuneParameters$new(
knn_tune_bayesian learner = LearnerKnn$new(),
strategy = "bayesian",
ncores = ncores,
seed = seed
)
$parameter_bounds <- knn_bounds
knn_tune_bayesian$parameter_grid <- param_list_knn
knn_tune_bayesian$split_type <- "stratified"
knn_tune_bayesian$optim_args <- optim_args
knn_tune_bayesian
# set data
$set_data(
knn_tune_bayesianx = train_x,
y = train_y
)
<- knn_tune_bayesian$execute(k = 3)
results #>
#> Registering parallel backend using 4 cores.
head(results)
#> Epoch setting_id k gpUtility acqOptimum inBounds Elapsed Score metric_optim_mean errorMessage l
#> 1: 0 1 4 NA FALSE TRUE 2.153 -0.2247332 0.2247332 NA 0
#> 2: 0 2 12 NA FALSE TRUE 2.274 -0.1600753 0.1600753 NA 0
#> 3: 0 3 20 NA FALSE TRUE 2.006 -0.1381042 0.1381042 NA 0
#> 4: 0 4 28 NA FALSE TRUE 2.329 -0.1403013 0.1403013 NA 0
#> 5: 0 5 36 NA FALSE TRUE 2.109 -0.1315129 0.1315129 NA 0
#> 6: 0 6 44 NA FALSE TRUE 2.166 -0.1258632 0.1258632 NA 0
To carry out the hyperparameter optimization with a grid search, only
the parameter_grid
is required:
<- mlexperiments::MLTuneParameters$new(
knn_tune_grid learner = LearnerKnn$new(),
strategy = "grid",
ncores = ncores,
seed = seed
)
$parameter_grid <- param_list_knn
knn_tune_grid$split_type <- "stratified"
knn_tune_grid
# set data
$set_data(
knn_tune_gridx = train_x,
y = train_y
)
<- knn_tune_grid$execute(k = 3)
results #>
#> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%)
#> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%)
#> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%)
#> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%)
#> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%)
#> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%)
#> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%)
#> Parameter settings [=================================================================================================] 9/9 (100%)
head(results)
#> setting_id metric_optim_mean k l
#> 1: 1 0.2187696 4 0
#> 2: 2 0.1597615 12 0
#> 3: 3 0.1349655 20 0
#> 4: 4 0.1406152 28 0
#> 5: 5 0.1318267 36 0
#> 6: 6 0.1258632 44 0
For the cross-validation experiments
(mlexperiments::MLCrossValidation
, and
mlexperiments::MLNestedCV
), a named list with the in-sample
row indices of the folds is required.
<- splitTools::create_folds(
fold_list y = train_y,
k = 3,
type = "stratified",
seed = seed
)str(fold_list)
#> List of 3
#> $ Fold1: int [1:2124] 1 2 3 4 5 7 9 10 11 12 ...
#> $ Fold2: int [1:2124] 1 2 3 6 8 9 11 13 16 17 ...
#> $ Fold3: int [1:2124] 4 5 6 7 8 10 12 14 15 16 ...
Furthermore, a specific hyperparameter setting that should be validated with the cross-validation needs to be selected:
<- mlexperiments::MLCrossValidation$new(
knn_cv learner = LearnerKnn$new(),
fold_list = fold_list,
seed = seed
)
<- knn_tune_grid$results$best.setting
best_grid_result
best_grid_result#> $setting_id
#> [1] 9
#>
#> $k
#> [1] 68
#>
#> $l
#> [1] 0
#>
#> $test
#> expression(fold_test$x)
$learner_args <- best_grid_result[-1]
knn_cv
$predict_args <- list(type = "response")
knn_cv$performance_metric <- metric("bacc")
knn_cv$return_models <- TRUE
knn_cv
# set data
$set_data(
knn_cvx = train_x,
y = train_y
)
<- knn_cv$execute()
results #>
#> CV fold: Fold1
#>
#> CV fold: Fold2
#> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%)
#>
#> CV fold: Fold3
#> CV progress [========================================================================================================] 3/3 (100%)
#>
head(results)
#> fold performance k l
#> 1: Fold1 0.8912781 68 0
#> 2: Fold2 0.8832388 68 0
#> 3: Fold3 0.8657147 68 0
Last but not least, the hyperparameter optimization and validation can be combined in a nested cross-validation. In each fold of the so-called “outer” cross-validation loop, the hyperparameters are optimized on the in-sample observations with one of the two strategies: Bayesian optimization or grid search. Both of these strategies are implemented again with a “nested” (“inner”) cross-validation. The best hyperparameter setting as identified by the inner cross-validation is then used to fit a model with all in-sample observations of the outer cross-validation loop and finally validate it on the respective out-sample observations.
The experiment classes must be parameterized as described above.
<- mlexperiments::MLNestedCV$new(
knn_cv_nested_bayesian learner = LearnerKnn$new(),
strategy = "bayesian",
fold_list = fold_list,
k_tuning = 3L,
ncores = ncores,
seed = seed
)
$parameter_grid <- param_list_knn
knn_cv_nested_bayesian$parameter_bounds <- knn_bounds
knn_cv_nested_bayesian$split_type <- "stratified"
knn_cv_nested_bayesian$optim_args <- optim_args
knn_cv_nested_bayesian
$predict_args <- list(type = "response")
knn_cv_nested_bayesian$performance_metric <- metric("bacc")
knn_cv_nested_bayesian
# set data
$set_data(
knn_cv_nested_bayesianx = train_x,
y = train_y
)
<- knn_cv_nested_bayesian$execute()
results #>
#> CV fold: Fold1
#>
#> Registering parallel backend using 4 cores.
#>
#> CV fold: Fold2
#> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%)
#>
#> Registering parallel backend using 4 cores.
#>
#> CV fold: Fold3
#> CV progress [========================================================================================================] 3/3 (100%)
#>
#> Registering parallel backend using 4 cores.
head(results)
#> fold performance k l
#> 1: Fold1 0.8912781 68 0
#> 2: Fold2 0.8832388 68 0
#> 3: Fold3 0.8657147 68 0
<- mlexperiments::MLNestedCV$new(
knn_cv_nested_grid learner = LearnerKnn$new(),
strategy = "grid",
fold_list = fold_list,
k_tuning = 3L,
ncores = ncores,
seed = seed
)
$parameter_grid <- param_list_knn
knn_cv_nested_grid$split_type <- "stratified"
knn_cv_nested_grid
$predict_args <- list(type = "response")
knn_cv_nested_grid$performance_metric <- metric("bacc")
knn_cv_nested_grid
# set data
$set_data(
knn_cv_nested_gridx = train_x,
y = train_y
)
<- knn_cv_nested_grid$execute()
results #>
#> CV fold: Fold1
#>
#> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%)
#> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%)
#> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%)
#> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%)
#> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%)
#> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%)
#> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%)
#> Parameter settings [=================================================================================================] 9/9 (100%)
#> CV fold: Fold2
#> CV progress [====================================================================>-----------------------------------] 2/3 ( 67%)
#>
#> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%)
#> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%)
#> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%)
#> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%)
#> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%)
#> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%)
#> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%)
#> Parameter settings [=================================================================================================] 9/9 (100%)
#> CV fold: Fold3
#> CV progress [========================================================================================================] 3/3 (100%)
#>
#> Parameter settings [=====================>---------------------------------------------------------------------------] 2/9 ( 22%)
#> Parameter settings [===============================>-----------------------------------------------------------------] 3/9 ( 33%)
#> Parameter settings [==========================================>------------------------------------------------------] 4/9 ( 44%)
#> Parameter settings [=====================================================>-------------------------------------------] 5/9 ( 56%)
#> Parameter settings [================================================================>--------------------------------] 6/9 ( 67%)
#> Parameter settings [==========================================================================>----------------------] 7/9 ( 78%)
#> Parameter settings [=====================================================================================>-----------] 8/9 ( 89%)
#> Parameter settings [=================================================================================================] 9/9 (100%)
head(results)
#> fold performance k l
#> 1: Fold1 0.8959736 52 0
#> 2: Fold2 0.8832388 68 0
#> 3: Fold3 0.8657147 68 0