51  Unsupervised Learning

Unsupervised learning aims to learn relationships within a dataset without focusing at a particular outcome. You will often hear of unsupervised learning being performed on unlabeled data. To be clear, it means it does not use the labels to guide learning - whether labels are available or not. You might, for example, perform unsupervised learning ahead of supervised learning as we shall see later. Unsupervised learning includes a number of approaches, most of which can be divided into two categories:

In rtemis, clustering algorithms begin with c_ and decomposition/dimensionality reduction algorithms begin with d_

51.1 Decomposition / Dimensionality Reduction

Use decomSelect() to get a listing of available decomposition algorithms:

decomSelect()
.:decomSelect
rtemis supports the following decomposition algorithms:

    Name                                   Description
   H2OAE                               H2O Autoencoder
 H2OGLRM                H2O Generalized Low-Rank Model
     ICA                Independent Component Analysis
  Isomap                                        Isomap
    KPCA           Kernel Principal Component Analysis
     LLE                      Locally Linear Embedding
     MDS                      Multidimensional Scaling
     NMF             Non-negative Matrix Factorization
     PCA                  Principal Component Analysis
    SPCA           Sparse Principal Component Analysis
     SVD                  Singular Value Decomposition
    TSNE   t-distributed Stochastic Neighbor Embedding
    UMAP Uniform Manifold Approximation and Projection

We can further divide decomposition algorithms into linear (e.g. PCA, ICA, NMF) and nonlinear dimensionality reduction, (also called manifold learning, like LLE and tSNE).

51.1.1 Principal Component Analysis (PCA)

x <- iris[, 1:4]
iris_PCA <- d_PCA(x)
09-26-23 23:03:17 Hello, egenn [d_PCA]
09-26-23 23:03:17 ||| Input has dimensions 150 rows by 4 columns, [d_PCA]
09-26-23 23:03:17     interpreted as 150 cases with 4 features. [d_PCA]
09-26-23 23:03:17 Performing Principal Component Analysis... [d_PCA]
09-26-23 23:03:17 Completed in 8.3e-05 minutes (Real: 0.01; User: 3e-03; System: 1e-03) [d_PCA]

mplot3_xy(iris_PCA$projections.train[, 1], 
          iris_PCA$projections.train[, 2], 
          group = iris$Species,
          xlab = "1st PCA component", 
          ylab = "2nd PCA component", 
          main = "PCA on iris")

51.1.2 Independent Component Analysis (ICA)

iris_ICA <- d_ICA(x, k = 2)
09-26-23 23:03:17 Hello, egenn [d_ICA]
09-26-23 23:03:17 ||| Input has dimensions 150 rows by 4 columns, [d_ICA]
09-26-23 23:03:17     interpreted as 150 cases with 4 features. [d_ICA]
09-26-23 23:03:17 Running Independent Component Analysis... [d_ICA]
09-26-23 23:03:17 Completed in 6e-04 minutes (Real: 0.04; User: 2e-03; System: 1e-03) [d_ICA]

mplot3_xy(iris_ICA$projections.train[, 1], 
          iris_ICA$projections.train[, 2], 
          group = iris$Species,
          xlab = "1st ICA component", 
          ylab = "2nd ICA component", 
          main = "ICA on iris")

51.1.3 Non-negative Matrix Factorization (NMF)

iris_NMF <- d_NMF(x, k = 2)
09-26-23 23:03:17 Hello, egenn [d_NMF]
09-26-23 23:03:18 ||| Input has dimensions 150 rows by 4 columns, [d_NMF]
09-26-23 23:03:18     interpreted as 150 cases with 4 features. [d_NMF]
09-26-23 23:03:18 Running Non-negative Matrix Factorization... [d_NMF]
09-26-23 23:03:18 Completed in 0.02 minutes (Real: 1.04; User: 0.73; System: 0.04) [d_NMF]

mplot3_xy(iris_NMF$projections.train[, 1], 
          iris_NMF$projections.train[, 2], 
          group = iris$Species,
          xlab = "1st NMF component", 
          ylab = "2nd NMF component", 
          main = "NMF on iris")

51.2 Clustering

Use clustSelect() to get a listing of available clustering algorithms:

clustSelect()
.:clustSelect
rtemis supports the following clustering algorithms:

      Name                                                 Description
    CMeans                                    Fuzzy C-means Clustering
    DBSCAN Density-based spatial clustering of applications with noise
       EMC                         Expectation Maximization Clustering
    HARDCL                                   Hard Competitive Learning
    HOPACH     Hierarchical Ordered Partitioning And Collapsing Hybrid
 H2OKMeans                                      H2O K-Means Clustering
    KMeans                                          K-Means Clustering
 MeanShift                                       Mean Shift Clustering
      NGAS                                       Neural Gas Clustering
       PAM                                 Partitioning Around Medoids
      PAMK               Partitioning Around Medoids with k estimation
      SPEC                                         Spectral Clustering

Let’s cluster iris and we shall also use an NMF decomposition as we saw above to project to 2 dimensions.
We’ll use two of the most popular clustering algorithms, K-means and PAM, aka K-medoids.

x <- iris[, 1:4]
iris_NMF <- d_NMF(x, k = 2)
09-26-23 23:03:18 Hello, egenn [d_NMF]
09-26-23 23:03:18 ||| Input has dimensions 150 rows by 4 columns, [d_NMF]
09-26-23 23:03:18     interpreted as 150 cases with 4 features. [d_NMF]
09-26-23 23:03:18 Running Non-negative Matrix Factorization... [d_NMF]
09-26-23 23:03:19 Completed in 4.4e-03 minutes (Real: 0.26; User: 0.26; System: 0.01) [d_NMF]

51.2.1 K-Means

iris.KMEANS <- c_KMeans(x, k = 3)
09-26-23 23:03:19 Hello, egenn [c_KMeans]
09-26-23 23:03:19 Performing K-means Clustering with k = 3... [c_KMeans]
09-26-23 23:03:19 Completed in 2e-03 minutes (Real: 0.12; User: 0.09; System: 0.01) [c_KMeans]

mplot3_xy(iris_NMF$projections.train[, 1], iris_NMF$projections.train[, 2],
          group = iris.KMEANS$clusters.train,
          xlab = "1st NMF component", 
          ylab = "2nd NMF component", 
          main = "KMEANS on iris")

51.2.2 Partitioning Around Medoids with k estimation (PAMK)

iris_PAMK <- c_PAMK(x, krange = 3:10)
09-26-23 23:03:19 Hello, egenn [c_PAMK]
09-26-23 23:03:19 Partitioning Around Medoids... [c_PAMK]
09-26-23 23:03:19 Estimated optimal number of clusters: 3 [c_PAMK]
09-26-23 23:03:19 Completed in 0.01 minutes (Real: 0.56; User: 0.39; System: 0.02) [c_PAMK]

mplot3_xy(iris_NMF$projections.train[, 1], iris_NMF$projections.train[, 2],
          group = iris_PAMK$clusters.train,
          xlab = "1st NMF component", 
          ylab = "2nd NMF component", 
          main = "PAMK on iris")

51.3 See also