Introduction to SuperML

Manish Saraswat

2019-01-04

SuperML R package is designed to unify the model training process in R like Python. Generally, it’s seen that people spend lot of time in searching for packages, figuring out the syntax for training machine learning models in R. This behaviour is highly apparent in users who frequently switch between R and Python. This package provides a python´s scikit-learn interface (fit, predict) to train models faster.

In addition to building machine learning models, there are handy functionalities to do feature engineering

This ambitious package is my ongoing effort to help the r-community build ML models easily and faster in R.

Install

You can install latest cran version using (recommended):

install.packages("superml")

You can install the developmemt version directly from github using:

devtools::install_github("saraswatmks/superml")

Examples - Machine Learning Models

This package uses existing r-packages to build machine learning model. In this tutorial, we’ll use data.table R package to do all tasks related to data manipulation.

Regression Data

We’ll quickly prepare the data set to be ready to served for model training.

load("../data/reg_train.rda")
# if the above doesn't work, you can try: load("reg_train.rda")

library(data.table)
library(caret)
#> Loading required package: lattice
#> Loading required package: ggplot2
library(superml)
#> Loading required package: R6
library(kableExtra)
library(Metrics)
#> 
#> Attaching package: 'Metrics'
#> The following objects are masked from 'package:caret':
#> 
#>     precision, recall

kable(head(reg_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig LandSlope Neighborhood Condition1 Condition2 BldgType HouseStyle OverallQual OverallCond YearBuilt YearRemodAdd RoofStyle RoofMatl Exterior1st Exterior2nd MasVnrType MasVnrArea ExterQual ExterCond Foundation BsmtQual BsmtCond BsmtExposure BsmtFinType1 BsmtFinSF1 BsmtFinType2 BsmtFinSF2 BsmtUnfSF TotalBsmtSF Heating HeatingQC CentralAir Electrical 1stFlrSF 2ndFlrSF LowQualFinSF GrLivArea BsmtFullBath BsmtHalfBath FullBath HalfBath BedroomAbvGr KitchenAbvGr KitchenQual TotRmsAbvGrd Functional Fireplaces FireplaceQu GarageType GarageYrBlt GarageFinish GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
1 60 RL 65 8450 Pave NA Reg Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2003 2003 Gable CompShg VinylSd VinylSd BrkFace 196 Gd TA PConc Gd TA No GLQ 706 Unf 0 150 856 GasA Ex Y SBrkr 856 854 0 1710 1 0 2 1 3 1 Gd 8 Typ 0 NA Attchd 2003 RFn 2 548 TA TA Y 0 61 0 0 0 0 NA NA NA 0 2 2008 WD Normal 208500
2 20 RL 80 9600 Pave NA Reg Lvl AllPub FR2 Gtl Veenker Feedr Norm 1Fam 1Story 6 8 1976 1976 Gable CompShg MetalSd MetalSd None 0 TA TA CBlock Gd TA Gd ALQ 978 Unf 0 284 1262 GasA Ex Y SBrkr 1262 0 0 1262 0 1 2 0 3 1 TA 6 Typ 1 TA Attchd 1976 RFn 2 460 TA TA Y 298 0 0 0 0 0 NA NA NA 0 5 2007 WD Normal 181500
3 60 RL 68 11250 Pave NA IR1 Lvl AllPub Inside Gtl CollgCr Norm Norm 1Fam 2Story 7 5 2001 2002 Gable CompShg VinylSd VinylSd BrkFace 162 Gd TA PConc Gd TA Mn GLQ 486 Unf 0 434 920 GasA Ex Y SBrkr 920 866 0 1786 1 0 2 1 3 1 Gd 6 Typ 1 TA Attchd 2001 RFn 2 608 TA TA Y 0 42 0 0 0 0 NA NA NA 0 9 2008 WD Normal 223500
4 70 RL 60 9550 Pave NA IR1 Lvl AllPub Corner Gtl Crawfor Norm Norm 1Fam 2Story 7 5 1915 1970 Gable CompShg Wd Sdng Wd Shng None 0 TA TA BrkTil TA Gd No ALQ 216 Unf 0 540 756 GasA Gd Y SBrkr 961 756 0 1717 1 0 1 0 3 1 Gd 7 Typ 1 Gd Detchd 1998 Unf 3 642 TA TA Y 0 35 272 0 0 0 NA NA NA 0 2 2006 WD Abnorml 140000
5 60 RL 84 14260 Pave NA IR1 Lvl AllPub FR2 Gtl NoRidge Norm Norm 1Fam 2Story 8 5 2000 2000 Gable CompShg VinylSd VinylSd BrkFace 350 Gd TA PConc Gd TA Av GLQ 655 Unf 0 490 1145 GasA Ex Y SBrkr 1145 1053 0 2198 1 0 2 1 4 1 Gd 9 Typ 1 TA Attchd 2000 RFn 3 836 TA TA Y 192 84 0 0 0 0 NA NA NA 0 12 2008 WD Normal 250000
6 50 RL 85 14115 Pave NA IR1 Lvl AllPub Inside Gtl Mitchel Norm Norm 1Fam 1.5Fin 5 5 1993 1995 Gable CompShg VinylSd VinylSd None 0 TA TA Wood Gd TA No GLQ 732 Unf 0 64 796 GasA Ex Y SBrkr 796 566 0 1362 1 0 1 1 1 1 TA 5 Typ 0 NA Attchd 1993 Unf 2 480 TA TA Y 40 30 0 320 0 0 NA MnPrv Shed 700 10 2009 WD Normal 143000
7 20 RL 75 10084 Pave NA Reg Lvl AllPub Inside Gtl Somerst Norm Norm 1Fam 1Story 8 5 2004 2005 Gable CompShg VinylSd VinylSd Stone 186 Gd TA PConc Ex TA Av GLQ 1369 Unf 0 317 1686 GasA Ex Y SBrkr 1694 0 0 1694 1 0 2 0 3 1 Gd 7 Typ 1 Gd Attchd 2004 RFn 2 636 TA TA Y 255 57 0 0 0 0 NA NA NA 0 8 2007 WD Normal 307000
8 60 RL NA 10382 Pave NA IR1 Lvl AllPub Corner Gtl NWAmes PosN Norm 1Fam 2Story 7 6 1973 1973 Gable CompShg HdBoard HdBoard Stone 240 TA TA CBlock Gd TA Mn ALQ 859 BLQ 32 216 1107 GasA Ex Y SBrkr 1107 983 0 2090 1 0 2 1 3 1 TA 7 Typ 2 TA Attchd 1973 RFn 2 484 TA TA Y 235 204 228 0 0 0 NA NA Shed 350 11 2009 WD Normal 200000
9 50 RM 51 6120 Pave NA Reg Lvl AllPub Inside Gtl OldTown Artery Norm 1Fam 1.5Fin 7 5 1931 1950 Gable CompShg BrkFace Wd Shng None 0 TA TA BrkTil TA TA No Unf 0 Unf 0 952 952 GasA Gd Y FuseF 1022 752 0 1774 0 0 2 0 2 2 TA 8 Min1 2 TA Detchd 1931 Unf 2 468 Fa TA Y 90 0 205 0 0 0 NA NA NA 0 4 2008 WD Abnorml 129900
10 190 RL 50 7420 Pave NA Reg Lvl AllPub Corner Gtl BrkSide Artery Artery 2fmCon 1.5Unf 5 6 1939 1950 Gable CompShg MetalSd MetalSd None 0 TA TA BrkTil TA TA No GLQ 851 Unf 0 140 991 GasA Ex Y SBrkr 1077 0 0 1077 1 0 1 0 2 2 TA 5 Typ 2 TA Attchd 1939 RFn 1 205 Gd TA Y 0 4 0 0 0 0 NA NA NA 0 1 2008 WD Normal 118000

split <- createDataPartition(y = reg_train$SalePrice, p = 0.7)
xtrain <- reg_train[split$Resample1]
xtest <- reg_train[!split$Resample1]
# remove features with 90% or more missing values
# we will also remove the Id column because it doesn't contain
# any useful information
na_cols <- colSums(is.na(xtrain)) / nrow(xtrain)
na_cols <- names(na_cols[which(na_cols > 0.9)])

xtrain[, c(na_cols, "Id") := NULL]
xtest[, c(na_cols, "Id") := NULL]

# encode categorical variables
cat_cols <- names(xtrain)[sapply(xtrain, is.character)]

for(c in cat_cols){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA' 
#> The data contains NA values. Imputing NA with 'NA'

# removing noise column
noise <- c('GrLivArea','TotalBsmtSF')

xtrain[, c(noise) := NULL]
xtest[, c(noise) := NULL]

# fill missing value with  -1
xtrain[is.na(xtrain)] <- -1
xtest[is.na(xtest)] <- -1

KNN Regression

knn <- KNNTrainer$new(k = 2,prob = T,type = 'reg')
knn$fit(train = xtrain, test = xtest, y = 'SalePrice')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
rmse(actual = xtest$SalePrice, predicted=labels)
#> [1] 5517.411

SVM Regression

#predicts probabilities - must specify mc_type ("OvA_hinge", "AvA_hinge")
svm <- SVMTrainer$new(type="ls")
svm$fit(xtrain, 'SalePrice')
#> Removing invalid columns.  The names should not start with anumber: 1stFlrSF,2ndFlrSF,3SsnPorch
pred <- svm$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 34516.13

Simple Regresison

lf <- LMTrainer$new(family="gaussian")
lf$fit(X = xtrain, y = "SalePrice")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -348560   -14786     -618    12638   217495  
#> 
#> Coefficients:
#>                 Estimate Std. Error t value Pr(>|t|)    
#> (Intercept)   -2.284e+06  1.547e+06  -1.477 0.140123    
#> MSSubClass    -8.554e+01  6.572e+01  -1.301 0.193421    
#> MSZoning       2.818e+02  1.411e+03   0.200 0.841784    
#> LotFrontage   -2.791e+01  3.368e+01  -0.829 0.407503    
#> LotArea        2.761e-01  1.509e-01   1.830 0.067623 .  
#> Street        -5.565e+04  2.055e+04  -2.708 0.006892 ** 
#> LotShape      -1.492e+03  2.002e+03  -0.745 0.456236    
#> LandContour    2.793e+03  2.143e+03   1.304 0.192665    
#> Utilities     -6.427e+04  3.343e+04  -1.923 0.054806 .  
#> LotConfig      1.502e+03  1.063e+03   1.413 0.158130    
#> LandSlope      5.832e+03  5.571e+03   1.047 0.295470    
#> Neighborhood  -2.862e+02  2.020e+02  -1.416 0.156957    
#> Condition1    -2.634e+03  8.796e+02  -2.994 0.002824 ** 
#> Condition2    -2.429e+03  4.164e+03  -0.583 0.559909    
#> BldgType      -2.294e+03  2.764e+03  -0.830 0.406867    
#> HouseStyle     2.063e+02  9.941e+02   0.208 0.835639    
#> OverallQual    1.456e+04  1.378e+03  10.564  < 2e-16 ***
#> OverallCond    6.761e+03  1.201e+03   5.630 2.38e-08 ***
#> YearBuilt      5.409e+02  7.887e+01   6.858 1.26e-11 ***
#> YearRemodAdd   8.737e+01  7.926e+01   1.102 0.270629    
#> RoofStyle      6.421e+03  1.949e+03   3.294 0.001023 ** 
#> RoofMatl      -3.760e+04  3.973e+03  -9.464  < 2e-16 ***
#> Exterior1st   -1.720e+03  6.438e+02  -2.672 0.007668 ** 
#> Exterior2nd    1.869e+03  6.179e+02   3.024 0.002562 ** 
#> MasVnrType     4.639e+03  1.612e+03   2.879 0.004082 ** 
#> MasVnrArea     2.500e+01  7.432e+00   3.364 0.000798 ***
#> ExterQual      1.575e+03  2.342e+03   0.673 0.501381    
#> ExterCond      1.249e+03  2.355e+03   0.530 0.596122    
#> Foundation    -2.028e+03  1.744e+03  -1.163 0.245210    
#> BsmtQual       6.699e+03  1.530e+03   4.379 1.33e-05 ***
#> BsmtCond      -3.194e+03  1.849e+03  -1.728 0.084363 .  
#> BsmtExposure   1.133e+03  8.694e+02   1.304 0.192677    
#> BsmtFinType1  -1.473e+03  7.050e+02  -2.090 0.036911 *  
#> BsmtFinSF1     1.929e+01  5.932e+00   3.251 0.001190 ** 
#> BsmtFinType2   1.788e+02  1.373e+03   0.130 0.896396    
#> BsmtFinSF2     1.196e+01  1.145e+01   1.044 0.296809    
#> BsmtUnfSF      7.807e+00  5.589e+00   1.397 0.162741    
#> Heating       -7.083e+02  3.262e+03  -0.217 0.828160    
#> HeatingQC     -1.448e+03  1.390e+03  -1.042 0.297795    
#> CentralAir     7.169e+03  5.572e+03   1.287 0.198490    
#> Electrical     2.958e+03  1.978e+03   1.496 0.135019    
#> `1stFlrSF`     5.727e+01  7.244e+00   7.906 7.36e-15 ***
#> `2ndFlrSF`     5.095e+01  6.095e+00   8.360 2.22e-16 ***
#> LowQualFinSF   3.968e+00  2.178e+01   0.182 0.855466    
#> BsmtFullBath   8.272e+03  2.946e+03   2.808 0.005083 ** 
#> BsmtHalfBath  -2.294e+03  4.444e+03  -0.516 0.605775    
#> FullBath       8.174e+03  3.218e+03   2.540 0.011231 *  
#> HalfBath      -6.042e+02  2.993e+03  -0.202 0.840077    
#> BedroomAbvGr  -7.474e+03  1.921e+03  -3.891 0.000107 ***
#> KitchenAbvGr  -2.916e+04  5.940e+03  -4.909 1.08e-06 ***
#> KitchenQual    9.395e+03  1.784e+03   5.267 1.71e-07 ***
#> TotRmsAbvGrd   4.822e+03  1.383e+03   3.487 0.000511 ***
#> Functional    -4.269e+03  1.419e+03  -3.007 0.002704 ** 
#> Fireplaces    -4.591e+03  2.727e+03  -1.684 0.092581 .  
#> FireplaceQu    3.733e+03  1.390e+03   2.686 0.007352 ** 
#> GarageType     1.801e+03  1.308e+03   1.376 0.169025    
#> GarageYrBlt   -1.002e+00  4.676e+00  -0.214 0.830379    
#> GarageFinish   1.446e+03  1.470e+03   0.983 0.325648    
#> GarageCars     1.411e+04  3.395e+03   4.155 3.54e-05 ***
#> GarageArea    -6.969e+00  1.137e+01  -0.613 0.540048    
#> GarageQual     2.293e+03  2.931e+03   0.782 0.434192    
#> GarageCond    -7.950e+02  2.957e+03  -0.269 0.788099    
#> PavedDrive    -1.471e+03  3.072e+03  -0.479 0.632228    
#> WoodDeckSF     3.517e+01  8.970e+00   3.921 9.44e-05 ***
#> OpenPorchSF    1.330e+01  1.682e+01   0.791 0.429295    
#> EnclosedPorch  1.675e+01  1.782e+01   0.940 0.347490    
#> `3SsnPorch`    1.839e+01  3.125e+01   0.588 0.556421    
#> ScreenPorch    8.479e+01  1.971e+01   4.302 1.87e-05 ***
#> PoolArea       1.273e+01  2.728e+01   0.467 0.640909    
#> Fence         -1.434e+03  1.355e+03  -1.058 0.290293    
#> MiscVal       -5.677e-01  1.815e+00  -0.313 0.754468    
#> MoSold         2.421e+02  3.779e+02   0.640 0.522020    
#> YrSold         4.812e+02  7.699e+02   0.625 0.532145    
#> SaleType       2.692e+03  1.213e+03   2.220 0.026647 *  
#> SaleCondition  6.574e+02  1.317e+03   0.499 0.617920    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for gaussian family taken to be 961763883)
#> 
#>     Null deviance: 6.7644e+12  on 1023  degrees of freedom
#> Residual deviance: 9.1271e+11  on  949  degrees of freedom
#> AIC: 24161
#> 
#> Number of Fisher Scoring iterations: 2
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 42085.83

Lasso Regression

lf <- LMTrainer$new(family = "gaussian", alpha=1, lambda = 1000)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39251.53

Ridge Regression

lf <- LMTrainer$new(family = "gaussian", alpha=0)
lf$fit(X = xtrain, y = "SalePrice")
predictions <- lf$predict(df = xtest)
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39048.85

Logistic Regression with CV

lf <- LMTrainer$new(family = "gaussian")
lf$cv_model(X = xtrain, y = 'SalePrice', nfolds = 5, parallel = FALSE)
#> Computation done.
predictions <- lf$cv_predict(df = xtest)
coefs <- lf$get_importance()
rmse(actual = xtest$SalePrice, predicted = predictions)
#> [1] 39868.51

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 0)
rf$fit(X = xtrain, y = "SalePrice")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>               tmp.order.tmp..decreasing...TRUE..
#> OverallQual                         853779674918
#> GarageCars                          540979092620
#> 1stFlrSF                            524075273991
#> GarageArea                          470339867542
#> YearBuilt                           389244470841
#> FullBath                            316880539159
#> BsmtFinSF1                          302814894689
#> GarageYrBlt                         299711687944
#> TotRmsAbvGrd                        222563877576
#> 2ndFlrSF                            195898795460
#> ExterQual                           191977400578
#> LotArea                             187452249182
#> YearRemodAdd                        174720881654
#> KitchenQual                         153588077672
#> Fireplaces                          139497407084
#> FireplaceQu                         135849726989
#> BsmtQual                            130356448141
#> MasVnrArea                          116484015593
#> Foundation                          111896573727
#> LotFrontage                         102110064822
#> OpenPorchSF                          95971655823
#> BsmtFinType1                         91272652787
#> BsmtUnfSF                            72496983186
#> WoodDeckSF                           60744463856
#> BedroomAbvGr                         47606931429
#> HeatingQC                            46648901825
#> GarageType                           46294317752
#> RoofStyle                            46232079711
#> Neighborhood                         45190484978
#> Exterior2nd                          37110890894
#> HalfBath                             36835880001
#> MoSold                               36165618613
#> MSSubClass                           35471695037
#> OverallCond                          34034719437
#> HouseStyle                           32011515205
#> GarageFinish                         28439661504
#> Exterior1st                          27414958735
#> BsmtFullBath                         23095348746
#> YrSold                               23032533124
#> SaleCondition                        21796081615
#> LotShape                             20844040940
#> BsmtExposure                         20493537059
#> PoolArea                             18153017479
#> MasVnrType                           16526542389
#> LotConfig                            16440419662
#> MSZoning                             15354466600
#> ScreenPorch                          14894425924
#> CentralAir                           14083855484
#> SaleType                             13252524029
#> LandContour                          12787679270
#> BldgType                             11920439520
#> EnclosedPorch                        11833958367
#> GarageCond                           11496596053
#> BsmtCond                             10602907496
#> Condition1                            9880847832
#> KitchenAbvGr                          9506013503
#> GarageQual                            9468060341
#> Fence                                 9028409858
#> LandSlope                             8876637237
#> ExterCond                             8784784300
#> BsmtFinType2                          6324964125
#> PavedDrive                            5700637874
#> BsmtFinSF2                            5460122633
#> Functional                            4918683792
#> RoofMatl                              2946224767
#> Electrical                            2909961298
#> MiscVal                               2600616431
#> 3SsnPorch                             1714230264
#> BsmtHalfBath                          1462471611
#> LowQualFinSF                          1332849288
#> Heating                               1219512331
#> Street                                 883151232
#> Condition2                             525051591
#> Utilities                               21047446
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 32191.64

Xgboost

xgb <- XGBTrainer$new(objective = "reg:linear"
                      , n_estimators = 500
                      , eval_metric = "rmse"
                      , maximize = F
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "SalePrice", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:180448.046875    val-rmse:175184.421875 
#> [51] train-rmse:8857.614258  val-rmse:32693.087891 
#> [101]    train-rmse:4940.895020  val-rmse:32328.214844 
#> [151]    train-rmse:3089.641357  val-rmse:32272.939453 
#> [201]    train-rmse:1999.938965  val-rmse:32301.414062 
#> [251]    train-rmse:1348.789185  val-rmse:32299.994141 
#> [301]    train-rmse:918.838562   val-rmse:32285.113281 
#> [351]    train-rmse:624.947327   val-rmse:32284.828125 
#> [401]    train-rmse:413.633789   val-rmse:32282.865234 
#> [451]    train-rmse:297.058624   val-rmse:32280.492188 
#> [500]    train-rmse:208.659836   val-rmse:32278.873047
pred <- xgb$predict(xtest)
rmse(actual = xtest$SalePrice, predicted = pred)
#> [1] 32278.87

Grid Search

xgb <- XGBTrainer$new(objective="reg:linear")

gst <-GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50), max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "SalePrice")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:142583.875000 
#> [10] train-rmse:14989.334961
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144377.453125 
#> [10] train-rmse:15865.090820
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143024.562500 
#> [10] train-rmse:16595.208984
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:142583.875000 
#> [50] train-rmse:3297.738525
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:144377.453125 
#> [50] train-rmse:4253.146973
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143024.562500 
#> [50] train-rmse:4584.068848
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143632.109375 
#> [10] train-rmse:30646.046875
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:145269.359375 
#> [10] train-rmse:30542.128906
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143839.015625 
#> [10] train-rmse:27930.759766
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143632.109375 
#> [50] train-rmse:17434.611328
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:145269.359375 
#> [50] train-rmse:16815.578125
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-rmse:143839.015625 
#> [50] train-rmse:15829.471680
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Random Search

rf <- RFTrainer$new()
rst <-RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter=3)
rst$fit(xtrain, "SalePrice")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 2
#> 
#> $accuracy_avg
#> [1] 0.0127107
#> 
#> $accuracy_sd
#> [1] 0.006822147
#> 
#> $auc_avg
#> [1] NaN
#> 
#> $auc_sd
#> [1] NA

Binary Classification Data

Here, we will solve a simple binary classification problem (predict people who survived on titanic ship). The idea here is to demonstrate how to use this package to solve classification problems.

Data Preparation

# load class
load('../data/cla_train.rda')
# if the above doesn't work, you can try: load("cla_train.rda")

kable(head(cla_train, 10)) %>%
  scroll_box(width = "100%", height = "300px")
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1 0 3 Braund, Mr. Owen Harris male 22 1 0 A/5 21171 7.2500 S
2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0 PC 17599 71.2833 C85 C
3 1 3 Heikkinen, Miss. Laina female 26 0 0 STON/O2. 3101282 7.9250 S
4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0 113803 53.1000 C123 S
5 0 3 Allen, Mr. William Henry male 35 0 0 373450 8.0500 S
6 0 3 Moran, Mr. James male NA 0 0 330877 8.4583 Q
7 0 1 McCarthy, Mr. Timothy J male 54 0 0 17463 51.8625 E46 S
8 0 3 Palsson, Master. Gosta Leonard male 2 3 1 349909 21.0750 S
9 1 3 Johnson, Mrs. Oscar W (Elisabeth Vilhelmina Berg) female 27 0 2 347742 11.1333 S
10 1 2 Nasser, Mrs. Nicholas (Adele Achem) female 14 1 0 237736 30.0708 C

# split the data
split <- createDataPartition(y = cla_train$Survived,p = 0.7)
xtrain <- cla_train[split$Resample1]
xtest <- cla_train[!split$Resample1]

# encode categorical variables - shorter way
for(c in c('Embarked','Sex','Cabin')){
    lbl <- LabelEncoder$new()
    lbl$fit(c(xtrain[[c]], xtest[[c]]))
    xtrain[[c]] <- lbl$transform(xtrain[[c]])
    xtest[[c]] <- lbl$transform(xtest[[c]])
}
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA' 
#> The data contains blank values. Imputing them with 'NA'

# impute missing values
xtrain[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]
xtest[, Age := replace(Age, is.na(Age), median(Age, na.rm = T))]

# drop these features
to_drop <- c('PassengerId','Ticket','Name')

xtrain <- xtrain[,-c(to_drop), with=F]
xtest <- xtest[,-c(to_drop), with=F]

Now, our data is ready to be served for model training. Let’s do it.

KNN Classification

knn <- KNNTrainer$new(k = 2,prob = T,type = 'class')
knn$fit(train = xtrain, test = xtest, y = 'Survived')
probs <- knn$predict(type = 'prob')
labels <- knn$predict(type='raw')
auc(actual = xtest$Survived, predicted=labels)
#> [1] 0.6776491

Naive Bayes Classification

nb <- NBTrainer$new()
nb$fit(xtrain, 'Survived')
pred <- nb$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
#> [1] 0.710828

SVM Classification

#predicts probabilities - must specify mc_type ("OvA_hinge", "AvA_hinge")
svm <- SVMTrainer$new(predict.prob = T, type="bc", mc_type="OvA_hinge")
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred[,2])
#> [1] 0.784916

#predicts labels
svm <- SVMTrainer$new(predict.prob = F, type="bc")
svm$fit(xtrain, 'Survived')
pred <- svm$predict(xtest)
auc(actual = xtest$Survived, predicted=pred)
#> [1] 0.7381008

Logistic Regression

lf <- LMTrainer$new(family="binomial")
lf$fit(X = xtrain, y = "Survived")
summary(lf$model)
#> 
#> Call:
#> stats::glm(formula = f, family = self$family, data = X, weights = self$weights)
#> 
#> Deviance Residuals: 
#>     Min       1Q   Median       3Q      Max  
#> -2.0647  -0.5139  -0.3550   0.5659   2.5979  
#> 
#> Coefficients:
#>              Estimate Std. Error z value Pr(>|z|)    
#> (Intercept)  1.882835   0.667638   2.820  0.00480 ** 
#> Pclass      -0.991285   0.198153  -5.003 5.66e-07 ***
#> Sex          3.014839   0.250533  12.034  < 2e-16 ***
#> Age         -0.050270   0.010402  -4.833 1.35e-06 ***
#> SibSp       -0.376242   0.132598  -2.837  0.00455 ** 
#> Parch       -0.137521   0.146524  -0.939  0.34796    
#> Fare         0.001671   0.002794   0.598  0.54981    
#> Cabin        0.017868   0.005923   3.017  0.00256 ** 
#> Embarked     0.076637   0.148818   0.515  0.60657    
#> ---
#> Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
#> 
#> (Dispersion parameter for binomial family taken to be 1)
#> 
#>     Null deviance: 823.56  on 623  degrees of freedom
#> Residual deviance: 495.21  on 615  degrees of freedom
#> AIC: 513.21
#> 
#> Number of Fisher Scoring iterations: 5
predictions <- lf$predict(df = xtest)
auc(actual = xtest$Survived, predicted = predictions)
#> [1] 0.7930805

Lasso Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=1)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7981181

Ridge Logistic Regression

lf <- LMTrainer$new(family="binomial", alpha=0)
lf$cv_model(X = xtrain, y = "Survived", nfolds = 5, parallel = FALSE)
#> Computation done.
pred <- lf$cv_predict(df = xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7937464

Random Forest

rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 3)
rf$fit(X = xtrain, y = "Survived")

pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               83.179331
#> Fare                              49.530245
#> Age                               44.026986
#> Cabin                             27.806777
#> Pclass                            22.210427
#> SibSp                             13.742906
#> Parch                              9.837351
#> Embarked                           6.777296

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739

Xgboost

xgb <- XGBTrainer$new(objective = "binary:logistic"
                      , n_estimators = 500
                      , eval_metric = "auc"
                      , maximize = T
                      , learning_rate = 0.1
                      ,max_depth = 6)
xgb$fit(X = xtrain, y = "Survived", valid = xtest)
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-auc:0.910654  val-auc:0.815229 
#> [51] train-auc:0.977882  val-auc:0.803243 
#> [101]    train-auc:0.990142  val-auc:0.808280 
#> [151]    train-auc:0.994508  val-auc:0.807354 
#> [201]    train-auc:0.996520  val-auc:0.809352 
#> [251]    train-auc:0.997454  val-auc:0.809496 
#> [301]    train-auc:0.998147  val-auc:0.808309 
#> [351]    train-auc:0.998554  val-auc:0.808136 
#> [401]    train-auc:0.998796  val-auc:0.809120 
#> [451]    train-auc:0.999060  val-auc:0.809699 
#> [500]    train-auc:0.999104  val-auc:0.809699

pred <- xgb$predict(xtest)
auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.8096989

Grid Search

xgb <- XGBTrainer$new(objective="binary:logistic")
gst <-GridSearchCV$new(trainer = xgb,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'))
gst$fit(xtrain, "Survived")
#> [1] "entering grid search"
#> [1] "In total, 4 models will be trained"
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.129808 
#> [10] train-error:0.098558
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.105769 
#> [10] train-error:0.088942
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.098558 
#> [10] train-error:0.069712
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.129808 
#> [50] train-error:0.038462
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.105769 
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.098558 
#> [50] train-error:0.036058
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.177885 
#> [10] train-error:0.153846
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.201923 
#> [10] train-error:0.137019
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.182692 
#> [10] train-error:0.115385
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.177885 
#> [50] train-error:0.110577
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.201923 
#> [50] train-error:0.096154
#> converting the data into xgboost format..
#> starting with training...
#> [1]  train-error:0.182692 
#> [50] train-error:0.081731
gst$best_iteration()
#> $n_estimators
#> [1] 10
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0
#> 
#> $accuracy_sd
#> [1] 0
#> 
#> $auc_avg
#> [1] 0.883034
#> 
#> $auc_sd
#> [1] 0.0242347

Random Search

rf <- RFTrainer$new()
rst <-RandomSearchCV$new(trainer = rf,
                             parameters = list(n_estimators = c(10,50),
                             max_depth = c(5,2)),
                             n_folds = 3,
                             scoring = c('accuracy','auc'),
                             n_iter = 3)
rst$fit(xtrain, "Survived")
#> [1] "In total, 3 models will be trained"
rst$best_iteration()
#> $n_estimators
#> [1] 50
#> 
#> $max_depth
#> [1] 5
#> 
#> $accuracy_avg
#> [1] 0.849359
#> 
#> $accuracy_sd
#> [1] 0.0264787
#> 
#> $auc_avg
#> [1] 0.8279856
#> 
#> $auc_sd
#> [1] 0.02242134

Let’s create some new feature based on target variable using target encoding and test a model.

# add target encoding features
xtrain[, feat_01 := smoothMean(train_df = xtrain,
                        test_df = xtest,
                        colname = "Embarked",
                        target = "Survived")$train[[2]]]
xtest[, feat_01 := smoothMean(train_df = xtrain,
                               test_df = xtest,
                               colname = "Embarked",
                               target = "Survived")$test[[2]]]

# train a random forest
# Random Forest
rf <- RFTrainer$new(n_estimators = 500,classification = 1, max_features = 4)
rf$fit(X = xtrain, y = "Survived")
pred <- rf$predict(df = xtest)
rf$get_importance()
#>          tmp.order.tmp..decreasing...TRUE..
#> Sex                               85.213569
#> Fare                              51.676287
#> Age                               47.071256
#> Cabin                             28.804936
#> Pclass                            22.431287
#> SibSp                             13.735815
#> Parch                              9.643044
#> feat_01                            4.449812
#> Embarked                           4.385365

auc(actual = xtest$Survived, predicted = pred)
#> [1] 0.7512739