50 Supervised Learning

This is a very brief introduction to machine learning using the rtemis package. rtemis includes functions for:

Visualization: static & interactive plots
Data preprocessing
Unsupervised learning: clustering & dimensionality reduction
Supervised learning: regression & classification

50.1 Installation

install the remotes package, if you don’t have it:

install.packages("remotes")

Install rtemis:

remotes::install_github("egenn/rtemis")

rtemis uses a large number of packages under the hood. Since you would not need to use all of them, they are not installed by default. Each time an rtemis function is called, a dependency check is run and a message is printed if any packages need to be installed.

For this short tutorial, start by installing ranger, if it is not already installed:

install.packages("ranger")

Load rtemis:

library(rtemis)

50.2 Data Input for Supervised Learning

All rtemis supervised learning functions begin with s. (“supervised”).
They accept the same first four arguments:
x, y, x.test, y.test
but are flexible and allow you to also provide combined (x, y) and (x.test, y.test) data frames, as explained below.

50.2.1 Scenario 1 (`x.train, y.train, x.test, y.test`)

In the most straightforward case, provide each featureset and outcome individually:

x: Training set features
y: Training set outcome
x.test: Testing set features (Optional)
y.test: Testing set outcome (Optional)

x <- rnormmat(nrow = 200, ncol = 10, seed = 2019)
w <- rnorm(10)
y <- x %*% w + rnorm(200)
res <- resample(y, seed = 2020)

[90m.:[0m[1mResampling Parameters[22m
[1m    n.resamples[22m: 10 
[1m      resampler[22m: strat.sub 
[1m   stratify.var[22m: y 
[1m        train.p[22m: 0.75 
[1m   strat.n.bins[22m: 4 
[90m09-28-23 21:16:40[90m [0m[0mCreated 10 stratified subsamples[90m [resample]
[0m

x.train <- x[res$Subsample_1, ]
x.test <- x[-res$Subsample_1, ]
y.train <- y[res$Subsample_1]
y.test <- y[-res$Subsample_1]

mod_glm <- s_GLM(x.train, y.train, x.test, y.test)

09-26-23 23:14:36 Hello, egenn [s_GLM]

.:Regression Input Summary
Training features: 147 x 10 
 Training outcome: 147 x 1 
 Testing features: 53 x 10 
  Testing outcome: 53 x 1 

09-26-23 23:14:37 Training GLM... [s_GLM]

.:GLM Regression Training Summary
    MSE = 0.84 (91.88%)
   RMSE = 0.92 (71.51%)
    MAE = 0.75 (69.80%)
      r = 0.96 (p = 5.9e-81)
   R sq = 0.92

.:GLM Regression Testing Summary
    MSE = 1.22 (89.03%)
   RMSE = 1.10 (66.88%)
    MAE = 0.90 (66.66%)
      r = 0.94 (p = 2.5e-26)
   R sq = 0.89
09-26-23 23:14:37 Completed in 0.01 minutes (Real: 0.44; User: 0.42; System: 0.02) [s_GLM]

50.2.2 Scenario 2: (`x.train, x.test`)

You can provide training and testing sets as a single data.frame each, where the last column is the outcome. Now x is the full training data and y the full testing data:

x: data.frame(x.train, y.train)
y: data.frame(x.test, y.test)

x <- rnormmat(nrow = 200, ncol = 10, seed = 2019)
w <- rnorm(10)
y <- x %*% w + rnorm(200)
dat <- data.frame(x, y)
res <- resample(dat, seed = 2020)

[90m09-28-23 21:16:40[90m [0m[0mInput contains more than one columns; will stratify on last[90m [resample]
[0m[90m.:[0m[1mResampling Parameters[22m
[1m    n.resamples[22m: 10 
[1m      resampler[22m: strat.sub 
[1m   stratify.var[22m: y 
[1m        train.p[22m: 0.75 
[1m   strat.n.bins[22m: 4 
[90m09-28-23 21:16:40[90m [0m[0mCreated 10 stratified subsamples[90m [resample]
[0m

dat_train <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]

mod_glm <- s_GLM(dat_train, dat_test)

09-26-23 23:14:37 Hello, egenn [s_GLM]

.:Regression Input Summary
Training features: 147 x 10 
 Training outcome: 147 x 1 
 Testing features: 53 x 10 
  Testing outcome: 53 x 1 

09-26-23 23:14:37 Training GLM... [s_GLM]

.:GLM Regression Training Summary
    MSE = 0.84 (91.88%)
   RMSE = 0.92 (71.51%)
    MAE = 0.75 (69.80%)
      r = 0.96 (p = 5.9e-81)
   R sq = 0.92

.:GLM Regression Testing Summary
    MSE = 1.22 (89.03%)
   RMSE = 1.10 (66.88%)
    MAE = 0.90 (66.66%)
      r = 0.94 (p = 2.5e-26)
   R sq = 0.89
09-26-23 23:14:37 Completed in 1.3e-04 minutes (Real: 0.01; User: 0.01; System: 1e-03) [s_GLM]

The dataPrepare() function will check data dimensions and determine whether data was input as separate feature and outcome sets or combined and ensure the correct number of cases and features was provided.

In either scenario, Regression will be performed if the outcome is numeric and Classification if the outcome is a factor.

50.3 Regression

50.3.1 Check Data with `check_data()`

x <- rnormmat(nrow = 500, ncol = 50, seed = 2019)
w <- rnorm(50)
y <- x %*% w + rnorm(500)
dat <- data.frame(x, y)
res <- resample(dat)

[90m09-28-23 21:16:40[90m [0m[0mInput contains more than one columns; will stratify on last[90m [resample]
[0m[90m.:[0m[1mResampling Parameters[22m
[1m    n.resamples[22m: 10 
[1m      resampler[22m: strat.sub 
[1m   stratify.var[22m: y 
[1m        train.p[22m: 0.75 
[1m   strat.n.bins[22m: 4 
[90m09-28-23 21:16:40[90m [0m[0mCreated 10 stratified subsamples[90m [resample]
[0m

dat_train <- dat[res$Subsample_1, ]
dat_test <- dat[-res$Subsample_1, ]

check_data(x)

  x: A data.table with 500 rows and 50 columns

  Data types
  * 50 numeric features
  * 0 integer features
  * 0 factors
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good

50.3.2 Single Model

mod <- s_GLM(dat_train, dat_test)

09-26-23 23:14:37 Hello, egenn [s_GLM]

.:Regression Input Summary
Training features: 374 x 50 
 Training outcome: 374 x 1 
 Testing features: 126 x 50 
  Testing outcome: 126 x 1 

09-26-23 23:14:37 Training GLM... [s_GLM]

.:GLM Regression Training Summary
    MSE = 1.02 (97.81%)
   RMSE = 1.01 (85.18%)
    MAE = 0.81 (84.62%)
      r = 0.99 (p = 1.3e-310)
   R sq = 0.98

.:GLM Regression Testing Summary
    MSE = 0.98 (97.85%)
   RMSE = 0.99 (85.35%)
    MAE = 0.76 (85.57%)
      r = 0.99 (p = 2.7e-105)
   R sq = 0.98
09-26-23 23:14:37 Completed in 2.3e-04 minutes (Real: 0.01; User: 0.01; System: 2e-03) [s_GLM]

50.3.3 Crossvalidated Model

mod <- train(dat, alg = "glm")

09-26-23 23:14:37 Hello, egenn [train]

.:Regression Input Summary
Training features: 500 x 50 
 Training outcome: 500 x 1 

09-26-23 23:14:37 Training Generalized Linear Model on 10 stratified subsamples... [train]
09-26-23 23:14:37 Outer resampling plan set to sequential [resLearn]

.:train GLM
Mean MSE of 10 stratified subsamples: 1.22
Mean MSE reduction: 97.50%
09-26-23 23:14:37 Completed in 4.2e-03 minutes (Real: 0.25; User: 0.23; System: 0.02) [train]

Use the describe() function to get a summary in (plain) English:

mod$describe()

Regression was performed using Generalized Linear Model. Model generalizability was assessed using 10 stratified subsamples. The mean R-squared across all testing set resamples was 0.97.

mod$plot()

50.4 Classification

50.4.1 Check Data

data(Sonar, package = "mlbench")
check_data(Sonar)

  Sonar: A data.table with 208 rows and 61 columns

  Data types
  * 60 numeric features
  * 0 integer features
  * 1 factor, which is not ordered
  * 0 character features
  * 0 date features

  Issues
  * 0 constant features
  * 0 duplicate cases
  * 0 missing values

  Recommendations
  * Everything looks good

res <- resample(Sonar)

09-26-23 23:14:38 Input contains more than one columns; will stratify on last [resample]
.:Resampling Parameters
    n.resamples: 10 
      resampler: strat.sub 
   stratify.var: y 
        train.p: 0.75 
   strat.n.bins: 4 
09-26-23 23:14:38 Using max n bins possible = 2 [strat.sub]
09-26-23 23:14:38 Created 10 stratified subsamples [resample]

sonar_train <- Sonar[res$Subsample_1, ]
sonar_test <- Sonar[-res$Subsample_1, ]

50.4.2 Single model

mod <- s_Ranger(sonar_train, sonar_test)

09-26-23 23:14:38 Hello, egenn [s_Ranger]

09-26-23 23:14:38 Imbalanced classes: using Inverse Frequency Weighting [dataPrepare]

.:Classification Input Summary
Training features: 155 x 60 
 Training outcome: 155 x 1 
 Testing features: 53 x 60 
  Testing outcome: 53 x 1 

.:Parameters
   n.trees: 1000 
      mtry: NULL 

09-26-23 23:14:38 Training Random Forest (ranger) Classification with 1000 trees... [s_Ranger]

.:Ranger Classification Training Summary
                   Reference 
        Estimated  M   R   
                M  83   0
                R   0  72

                   Overall  
      Sensitivity  1      
      Specificity  1      
Balanced Accuracy  1      
              PPV  1      
              NPV  1      
               F1  1      
         Accuracy  1      
              AUC  1      

  Positive Class:  M 

.:Ranger Classification Testing Summary
                   Reference 
        Estimated  M   R   
                M  25  11
                R   3  14

                   Overall  
      Sensitivity  0.8929 
      Specificity  0.5600 
Balanced Accuracy  0.7264 
              PPV  0.6944 
              NPV  0.8235 
               F1  0.7812 
         Accuracy  0.7358 
              AUC  0.8643 

  Positive Class:  M 
09-26-23 23:14:38 Completed in 2.3e-03 minutes (Real: 0.14; User: 0.21; System: 0.03) [s_Ranger]

50.4.3 Crossvalidated Model

mod <- train(Sonar)

09-26-23 23:14:38 Hello, egenn [train]

.:Classification Input Summary
Training features: 208 x 60 
 Training outcome: 208 x 1 

09-26-23 23:14:38 Training Ranger Random Forest on 10 stratified subsamples... [train]
09-26-23 23:14:38 Outer resampling plan set to sequential [resLearn]

.:train Ranger
Mean Balanced Accuracy of 10 stratified subsamples: 0.83
09-26-23 23:14:39 Completed in 0.02 minutes (Real: 1.15; User: 2.06; System: 0.34) [train]

mod$describe()

Classification was performed using Ranger Random Forest. Model generalizability was assessed using 10 stratified subsamples. The mean Balanced Accuracy across all testing set resamples was 0.83.

mod$plot()

mod$plotROC()

mod$plotPR()

50.4.4 Evaluation of a binary classifier

Evaluation metrics for a binary classifier

50.5 Understanding Overfitting

Overfitting occurs when a model fits noise in the outcome. To make this clear, consider the following example:

Assume a random variable x:

set.seed(2020)
x <- sort(rnorm(500))

and a data-generating function fn():

fn <- function(x) 12 + x^5

The true y is therefore equal to fn(x):

y_true <- fn(x)

However, assume y is recorded with some noise, in this case gaussian:

y_noise <- fn(x) + rnorm(500, sd = sd(y_true))

We plot:

mplot3_xy(x, list(y_noise = y_noise, y_true = y_true))

We want to find a model that best approximates y_true, but we only know y_noise.

A maximally overfitted model would model noise perfectly:

mplot3_xy(x, list(Overfitted_model = y_noise, Ideal_model = y_true), type = "l")

An example of an SVM set up to overfit heavily:

mplot3_xy(x, y_noise, 
          fit = "svm",
          fit.params = list(kernel = "radial", cost = 100, gamma = 100))

An example of a good approximation of fn using a GAM with penalized splines:

mplot3_xy(x, y_noise, fit = "gam")

50.6 rtemis Documentation

For more information on using rtemis, see the rtemis online documentation and vignettes