# Learning from data

This page is designed to explain how ?lpdf objects can be used for automated learning of hyperparameters and general fitting of predictors.

library(outerbase)

Let’s set things up for a 3 dimensional example.

om = new(outermod)
d = 3
setcovfs(om, rep("mat25pow",d))
knotlist = list(seq(0.01,0.99,by=0.02),
seq(0.01,0.99,by=0.02),
seq(0.01,0.99,by=0.02))
setknot(om, knotlist)

## Hyperparameter impact

The values of covariance function hyperparameters are extremely important for successful near-interpolation of surfaces. This impact is felt in outerbase and this section is designed to illustrate that.

Consider the first four basis functions plotted below.

sampsize = 30
design1d = seq(1/(2*sampsize),1-1/(2*sampsize),1/sampsize)
x = cbind(design1d,sample(design1d),sample(design1d))
ob = new(outerbase, om, x)
basis_func0 = ob$getbase(1) matplot(x[,1],basis_func0[,1:4], type='l', ylab="func", xlab="first dim") The hyperparameters will now be changed in a way that we know will change these basis functions. Note that ob$build is required after updating the hyperparameters for it to take effect as ob does not know when om is updated.

hyp0 = gethyp(om)
hyp0 = 3 #changing the power on first parameter
om$updatehyp(hyp0) ob$build() #rebuild after updatehyp

This leads to an asymmetric basis function set for this first dimension because of the power transform in ?covf_mat25pow.

basis_func1 = ob$getbase(1) matplot(x[,1],basis_func0[,1:4], type='l', ylab="func", xlab="first dim", main="original hyperparameters") matplot(x[,1],basis_func1[,1:4], type='l', ylab="func", xlab="first dim", main="new hyperparameters")  ## lpdf for learning The core building block for outerbase learning is the base class ?lpdf, log probability density functions. This base class forms the backbone behind learning using statistical models. Instances of this class allow us to optimize coefficients, infer on uncertainty and learn hyperparameters of covariance functions. A small dataset can illustrate (almost) all core concepts related to lpdf. y = obtest_borehole3d(x) The length of the hyperparameters is 2 dimensions for each covariance function for a total of 6 hyperparameters. gethyp(om) #> inpt1.scale inpt1.power inpt2.scale inpt2.power inpt3.scale inpt3.power #> 0 3 0 0 0 0 hyp0 = c(-0.5,0,-0.5,0,-0.5,0) om$updatehyp(hyp0)

We will use 60 terms to build our approximation.

terms = om$selectterms(60) The idea is to build a loglik object that represent that log likelihood of our data given the model and coefficients. We will begin with ?loglik_std, although this model is not recommended for speed reasons. We can initialize it and check that we can get gradients with respect to coefficients, covariance hyperparameters, and parameters of the lpdf object itself. loglik = new(loglik_std, om, terms, y, x) coeff0 = rep(0,loglik$nterms)
loglik$update(coeff0) # update it to get gradients loglik$val
#>  -160.7842
head(loglik$grad) # dim 60 for number of coeffients #> [,1] #> [1,] 0.1046944 #> [2,] -1.0283251 #> [3,] -0.3458546 #> [4,] 2.9306071 #> [5,] -0.4560907 #> [6,] -0.1768621 A reasonable statistical model also needs prior on the coefficients. This tells us what distribution we expect on the coefficients. logpr = new(logpr_gauss, om, terms) To make the handling of these two objects loglik and logpr easier, the ?lpdfvec class is helpful to tie the objects together. It will share the hyperparameter vector between them, so they need to be based on the same outermod object. But it will concatenate the parameters. logpdf = new(lpdfvec, loglik, logpr) para0 = getpara(logpdf) para0 #> noisescale coeffscale #> 3.035664 6.000000 para0 = 4 logpdf$updatepara(para0)
getpara(logpdf)
#> noisescale coeffscale
#>   3.035664   4.000000

The coefficients coeff are considered ancillary parameters that need to be optimized out (or something more sophisticated, hint on current research). For this class, it is easiest to do this via lpdf$optnewton, which takes a single Newton step to optimize the coefficients. logpdf$optnewton()

Some test data will help illustrate prediction.

testsampsize = 1000
xtest = matrix(runif(testsampsize*d),ncol=d)
ytest = obtest_borehole3d(xtest)

We can see the predictive accuracy using the ?predictor class which automatically pulls correct information out of loglik to design predictions.

predt = new(predictor,loglik)
predt$update(xtest) yhat = as.vector(predt$mean())
varpred = as.vector(predt$var()) plot(yhat,ytest, xlab="prediction", ylab="actual") hist((ytest-yhat)/sqrt(varpred), main="standarized test residuals", xlab = "standarized test residuals")  ## lpdf and hyperparameters The main value in this approach is the automated pulling of important gradients related to covariance hyperparameters and model parameters. logpdf$optnewton()
logpdf$gradhyp # dim 6 for all hyperparameter #> [,1] #> [1,] -1.2753801 #> [2,] 0.1352539 #> [3,] 7.4960689 #> [4,] 0.8817342 #> [5,] 8.5644385 #> [6,] 0.9453822 logpdf$gradpara   # dim 2 since 2 parameters
#>            [,1]
#> [1,] -18.283833
#> [2,]  -3.063965

This allows us to use custom functions to learn these hyperparameters to give maximum predictive power. The goal right now is a single point estimate of these hyperparameters. One has to be very careful to keep these in good ranges, and the call below will make sure to return -inf if there is a problem with the chosen hyperparameters.

totobj = function(parlist) { #my optimization function for tuning
regpara = logpdf$paralpdf(parlist$para) # get regularization for lpdf
reghyp = om$hyplpdf(parlist$hyp) # get regularization for om
if(is.finite(regpara) && is.finite(reghyp)) { # if they pass
om$updatehyp(parlist$hyp)        # update hyperparameters
logpdf$updateom() # update the outerbase inside logpdf$updatepara(parlist$para) # update parameter logpdf$optnewton()            # do opt

gval = parlist #match structure
gval$hyp = -logpdf$gradhyp-om$hyplpdf_grad(parlist$hyp)
gval$para = -logpdf$gradpara-logpdf$paralpdf_grad(parlist$para)
list(val = -logpdf$val-reghyp-regpara, gval = gval) } else list(val = Inf, gval = parlist) } This works by querying the objects themselves to check if the parameters are reasonable. This package provides a custom deployment of BFGS in ?BFGS_std to optimize functions like this. parlist = list(para = getpara(logpdf), hyp = gethyp(om)) totobj(parlist) #>$val
#>  86.02803
#>
#> $gval #>$gval$para #> [,1] #> [1,] 18.283833 #> [2,] 2.563965 #> #>$gval$hyp #> [,1] #> [1,] -4.0817628 #> [2,] -0.1352539 #> [3,] -12.8532117 #> [4,] -0.8817342 #> [5,] -13.9215814 #> [6,] -0.9453822 opth = BFGS_std(totobj, parlist, verbose=3) # #> #> ########started BFGS####### #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 0 86.028 NA NA 0.1 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 1 83.146 -2.88146 -45.2956 0.1 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 2 78.1594 -4.98609 -0.235414 0.50357 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 3 60.5026 -17.6542 -31.1706 0.539329 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 4 55.385 -5.11704 -2.77443 0.573679 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 5 46.786 -8.59786 -7.32502 0.606459 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 6 38.3531 -8.43183 -5.27123 0.637561 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 7 33.1031 -5.24912 -9.31017 0.666913 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 8 32.2324 -0.870293 -9.3074 0.694484 #> iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate #> 9 31.5246 -0.707647 -1.5764 0.720271 #> num iter: 10 obj start: 86.02803 obj end: 31.49121 #> final learning rate: 0.7442976 #> approx lower bound (not achieved): 31.43811 #> #########finished BFGS######## Then we just have to update out parameters and re-optimize, and we can check that we are at least must closer to stationary point. totobj(opth$parlist)
#> $val #>  31.49121 #> #>$gval
#> $gval$para
#>            [,1]
#> [1,] -0.5105922
#> [2,] -1.4425483
#>
#> $gval$hyp
#>             [,1]
#> [1,]  0.02649228
#> [2,]  0.06837665
#> [3,] -0.40623766
#> [4,]  0.06382342
#> [5,] -0.38065302
#> [6,]  0.19144968

These steps can all be nicely wrapped up in another function ?BFGS_lpdf, which is a simpler call with the same result. Note because of some things not fully understood, these numbers will not match exactly above, but they will be quite close and functionally the same.

opth = BFGS_lpdf(om, logpdf,
parlist=parlist,
verbose = 3, newt= TRUE)
#>
#> ########started BFGS#######
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        0    86.028           NA           NA           0.1
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        1    83.146     -2.88146     -45.2956           0.1
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        2   78.1594     -4.98609    -0.235414       0.50357
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        3   60.5026     -17.6542     -31.1706      0.539329
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        4    55.385     -5.11704     -2.77443      0.573679
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        5    46.786     -8.59786     -7.32502      0.606459
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        6   38.3531     -8.43183     -5.27123      0.637561
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        7   33.1031     -5.24912     -9.31017      0.666913
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        8   32.2324    -0.870293      -9.3074      0.694484
#>  iter.no obj.value wolfe.cond.1 wolfe.cond.2 learning.rate
#>        9   31.5246    -0.707647      -1.5764      0.720271
#> num iter: 10  obj start: 86.02803  obj end: 31.49121
#> final learning rate: 0.7442976
#> approx lower bound (not achieved): 31.43811
#> #########finished BFGS########

The revised predictions are then built after hyperparameter optimization where we find improved predictive accuracy in nearly every category. Here we can see a much better alignment between predictions and actual alongside a better plot of standardized residuals (more like a standard normal distribution).

predtt = new(predictor,loglik)
predtt$update(xtest) yhat = as.vector(predtt$mean())
varpred = as.vector(predtt\$var())

plot(ytest,yhat)
hist((ytest-yhat)/sqrt(varpred), main="standarized test residuals",
xlab = "standarized test residuals")  